Linked Data 101

What is the point of using Linked Data?

The point of Linked Data is for webmasters to publish data in open, standardized formats that facilitate reuse by others. Others may include customers, suppliers and partners. If you’re a government authority, publishing Linked Data allows your data to be more readily used by other government agencies, the research community, and the general public.

Organizations employing Linked Data are improving data quality, shortening development cycles, and significantly reduce maintenance costs. Enterprises are realizing a return on investment on Linked Data projects typically within 6-12 months.

With traditional 3 tier data architectures, 60% of the full cost of an application is in application and data maintenance. Linked Data based solutions cost a small fraction of traditional applications due to efficiencies in data re-use, data exchange standards and cloud computing becoming more common. Unlike proprietary approaches, there is no vendor lock-in nor vendor dominance.

3-Minute Briefing: Semantic Web Elevator Pitch

Where can I see a demonstration of Linked Data?

3 Round Stones operates an online Linked Data demonstration site at demo.3roundstones.net. This demonstration provides information regarding nuclear power plants located in the United States. It contains data gleaned from DBpedia, Open Street Maps, SEC Info, the U.S. Environmental Protection Agency’s Facilities Registry System, Substance Registry System and Toxic Release Inventory and Abt Associates report on corporate ownership.

The purpose of this demonstration is to show the benefits of combining data from multiple sources and the ease and speed of creating Web applications using Callimachus.

Why should we publish Linked Data when we already publish our data in a variety of formats on the Web?

Providing data in open, standardized formats that facilitate reuse by other government agencies and/or departments, and third parties, e.g., journalists, academic, non-profit and corporate researchers and the general public. Generalized data sharing is made possible by the use of an international data exchange standard, the Resource Description Framework (RDF). Data in RDF allows for rapid combination of information from multiple data sources, including DBpedia (the RDF version of Wikipedia) and literally thousands of Linked Open Data sets available on the web.

Publishing as RDF allows people to rapidly visualize ad hoc queries on maps, in tables, bar charts and many other common business views. Publishing Linked RDF is what Tim Berners-Lee, the inventor of the Web, calls “5 star” Linked Data. Linked Data publishing and use by enterprises has resulted in cost reductions in development time, deployment cycle, and maintenance compared with traditional data sharing mechanisms.

How do I know that Linked Data is not a passing fad?

Commercial companies recognize the market opportunity and are investing millions of dollars of R&D budget to create and support production tools improved creation, discovery and visualization of data on the Web. This includes Google, Oracle, IBM, Microsoft and Facebook. The UK Parliament pioneered publishing Linked Data with the backing and support of Sir Tim Berners-Lee, the inventor of the World Wide Web.  Additionally, governments such as the US, Sweden, Germany, France, Spain, New Zealand and Australia are adopting Linked Data as a data publication and consumption model for Open Government Initiatives.  The BBC is using Linked Data to operate large sections of its Web site and also used it to report on the last Olympics.

While the term “Linked Data” is a relatively new term (circa 2007), it is based on International Standards and technologies that have formally and comprehensively presented, discussed and peer reviewed by literally hundreds of academic institutions, technology companies, and government agencies from around the world through the World Wide Web Consortium (W3C) for well in excess of a decade.

How long do projects typically take to implement?

A Linked Data Approach implies “cooperation without coordination.” This means that all members of an organization are not required to agree on schema, in advance or at any point in the development effort. Instead, a Linked Data approach recognizes that there is no one way to describe an organization, its products or services. Instead, a Linked Data approach embraces that individuals possess knowledge within their area of expertise and that they should be able to describe business process, rules and their data with both flexibility and standards.

Are there any gotchas?

a) There are several issues when adopting Linked Data that could become ‘gotchas’: Care must be taken to avoid biasing the value of high quality datasets by tightly coupling them to specific high profile applications. In short, do not do to your Linked Data what MDM did to your relational databases. Recognize that the core benefits of Linked Data involve the combination of data with data from other sources. Successful Linked Data projects produce generic, reusable data that may be combined with data from other sources to allow applications not yet conceived. Think reuse, not specific uses.

b) Openly publishing data, be it Linked Data or not, must be undertaken under appropriate licensing which is unambiguous, appropriate, unrestrictive, and realistic as possible.  We offer more detail below under “Risks”.

c) Avoid “triplifying” data by automatic script. Triplifying data by script is not the same as creating well-structured Linked Data suitable for building applications.  Proper data modeling is an essential first step.  Efforts to automatically generate billions of RDF “triples” and publish them on the Web is not the same as producing high quality data sets of properly modeled data.

d) People and organizations experienced with data modeling in RDF are still relatively rare.

As one embarks on an effort to convert a dataset, what are the factors that determine conversion cost?

a) What makes a dataset complex or simple

Decisions on exposing a data set should based on usefulness to others. Usefulness is a measure of its ability to be used by others, both intra-agency, interagency and by the public. Only data of general usefulness should generally be published as Linked Data; agency- or application-specific data is not always useful to the rest of the world.

The following are indicators of, but not hard and fast rules about, what factors impact the time required to expose data that is useful to others.

  1. More complex relational data models (say 60 or more tables) are more time consuming and complex to model, and therefore the time required is slightly longer, (measured in weeks however, not months or years);
  2. Data sets that require a prior knowledge of the agency organization, e.g., structure, regulations, workflow, internal vocabularies (e.g., for naming conventions) require meetings between data modelers and internal specialists;
  3. Domain specific data sets (e.g., geography, chemistry, physics or complex regulation) may require specialized domain expertise that may be harder to find or schedule.

b) What does it cost to host RDF?

There is no one size fits all answer on pricing, however the factors are all familiar to IT managers and procurement departments. The cost of modeling Linked Data and hosting it is based on several components including:

  1. Time required to remodel data, typically measured in several weeks;
  2. Frequency and size of updates;
  3. Access, including query volume;
  4. Applications (if applicable) based on Linked Data sets.

Technology teams accustomed to managing hardware, networking, and service level agreements for traditional 3 tier applications will understand similar components to hosting a Linked Data service.

Hosting is quickly becoming commoditized in terms of pricing. The value proposition should focus more on the service level agreement, patches and upgrades, security and other features that are vital to any production data or application service.

Data consisting of millions of rows in a relational database is typically easy and inexpensive to host. There are economies of scale, hosting more data sets is not necessarily proportionally more expensive.

The value proposition in using Linked Data not on the lower cost to host the data (RDF triples), however, the ability to provide high availability production managed services using Linked Data.

There are software-as-a-service options which provide an easy scalable option in the early stages while enabling analysis of medium to long-term possibilities as the profile of the data and its use is established.

c) What is the cost of putting up a new version of the data?

This is dependent on the quantity of data, quality of modeling, frequency and size of updates, plus the ability of the chosen store to take live updates. The cost is often negligible and included in the cost of a hosting contract.

Once the data is properly modeled, scripts are run to automatically convert data to Linked Data (as RDF triples) on a routine basis (e.g., hourly, daily, weekly, etc)

What about costs with a Linked Data approach?

Traditional data warehouse projects require significant upfront coordination. The cost of vocabulary creation and/or schema alignment, creating data dictionaries and building applications involves teams of typically 6-12 analysts, data modelers, programmers and security specialists. By comparison, Linked Data applications are typically modeled within a 30 day sprint and applications can be created in hours. With emerging tools that are commercially supported, developers can host Linked Data applications on the cloud. Thus, within two months a reasonably complex data set can modeled, converted as Linked Data, and made available with powerful navigation and visualization features in less than two months on average.

I’ve heard that in the future, graph data may make relational data bases obsolete. Is that true? Please explain.

No, that is not true. They are different tools for different jobs. Both are very necessary to managing data in the modern information technology landscape.

Relational databases are excellent at providing highly tuned access to structured, pre-defined data for typically pre-defined queries.  There will be a significant need for this for a long time to come; they will not be obsolete in the foreseeable future.  RDBMS are well suited for what they do well.

Relational systems and applications built on relational databases, do not excel in handling data that is neither pre-defined in terms of model nor relationship. When you are exploring how data is inter-related, in order to learn about trends, patterns or things implicit in the data, is when you should consider a graph or Linked Data view of data. This is particularly relevant to intelligence applications, scientific research and many other types of applications where you don’t exactly know in advance what you are looking for.

Relational databases are difficult and expensive to combine. Linked Data approaches are making rapid inroads in areas where data must be combined from multiple relational databases. Again, relational databases and Linked Data complement each other in such scenarios.

We believe will continue to be a place for both relational- and graph-based data, both supporting each other both within the enterprise and externally via the World Wide Web, as Linked Data.

As a manager, should I consider developing in-house expertise to assist program SMEs with producing Linked Data? If I do, how can my organization assess the quality of what is being produced?

It is likely that your organization already has contractors and in-house staff familiar with the organization’s important data assets.  Familiarity with Linked Data tools and techniques will come with time and should not be considered daunting.

Once data is converted to RDF, there are Web based tools and interfaces to explore and view the data. One of the important features of Linked Data is that, developers can programmatically query the data through a SPARQL endpoint, allowing them to view the content. This is similar but more flexible to a “view” in SQL. SPARQL query capability can be locked down for use by only authenticated personnel, or can be made available more widely, depending upon the use case. There techniques that are identical to the validation process performed on relational data by developers and data curators.

Their are different for Linked Data, however the concepts for data validation are similar to anyone who is a relational database professional. Your data experts will recognize the data in the RDF format. There are Linked Data tools, such as Callimachus, an Open Source Linked Data management system, to “follow your nose” and explore the data which is very powerful. Callimachus has a wiki-like interface, and a Class-based template engine that allows you to visualize and create Linked Data easily and quickly. With proper authentication, a user can update the underlying data in the graph database which is very useful.

Access to a small agile team who can guide and work within the agency on Linked Data issues is advisable. Practically speaking, not every agency will be able to have an in-house expert on Linked Data.  However, if the agency has an office of information access and/or management (such as EPA’s OEI), it is logical that this team’s responsibilities would include participation with other agencies, standards groups (W3C, others), and at conferences discussing best practices with agency and other teams supporting the agency’s mission.

What should the next steps be? What (if any) training would our staff need?

Although Linked Data is no more complex than traditional data modeling, it does require a different way of thinking focused on expressing relationships through URIs. Just as an agency works with an in-house or contractor data modeling expert, the same would be true with Linked Data. There is both data subject domain expertise required, as well as, specialization in data modeling strategy and tactics.

Introductory training, best applied to small groups over a few days typically scheduled over 6-8 calendar weeks, is needed to discuss the differences between traditional vocabulary development and modeling approach for Linked Data. Experience has shown that these new generic techniques are then best supported and developed with situation specific workshops and/or mentoring as confidence grows.

What hardware/software purchases are necessary?

Many of the tools used for Linked Data are open source, including simple scripts and operating system commands, the use of which is openly shared within a community on the web. There is no cost associated with such tools.The storing and publishing of Linked Data can be handled by a simple web server.  However, many more benefits flow from being able query that data, which requires it to be held in a linked data store, or RDF database.  These are available as open source or proprietary services that you can host yourself or as a platform-as-a-service (PaaS) managed service.

Correct configuration of Web servers to publish Linked Data (e.g. using correct Content-Type information) is essential to reuse.  Failure to understand Web standards can compromise an otherwise useful implementation.  Therefore, care should be taken to have Linked Data reviewed by someone with relevant experience.

What human capital / infrastructure needs are required to support this kind of work?

This depends on the size of organization, the amount of data, and the rate of change to data.  Experience has shown that having a small group/team of Linked Data aware people who can evangelize, help, support, guide and monitor a wider organization works well.

What are the steps to creating a data-driven application?

In practice, this approach requires speaking with one group at a time and exposing each RDBMS via either as realtime SPARQL query or periodic dump that is converted to an RDF format. Next, applications are rapidly created by Web developers using data-driven application tools and one or more Linked Data sets.