Planet RDF

Syndicate content
It's triples all the way down
Updated: 17 weeks 5 days ago

NYMUG: New York Mark Logic Users Group!

Wed, 2009-11-04 20:32
The inaugural meeting of the New York Mark Logic User Group will take place on Wednesday evening, 11 November 2009.
Categories: Semantic Web

DocBook V5.0

Wed, 2009-11-04 19:55
DocBook V5.0 is an OASIS Standard!
Categories: Semantic Web

Stuart Harrison Talks with Talis about Lichfield and Public Data

Wed, 2009-11-04 15:14

In a first for the Platform, we’ve done a video podcast. In it, I talk with Stuart Harrison, (@pezholio) webmaster at Lichfield District Council.

We talk about what local authorities here in the UK can do with open data, how Linked Data plays an important role, and what more is needed for local government to provide better web-based services.

This conversation was recorded on 27th October.
For other Talis podcasts in this Nodalities series, see here

Categories: Semantic Web

DC-2009 papers and presentations published

Tue, 2009-11-03 00:59
2009-11-02, The DC-2009 conference, held in Seoul, Korea, 12-16 October 2009, featured high quality tutorials, keynotes, conference papers and workshop sessions. The event was attended by around 100 participants from 18 countries and territories who engaged in lively discussions around the theme of Semantic Interoperability of Linked Data. The conference proceedings are available in the DCMI Conference Paper Repository and many of the presentations are linked from the program page on the conference Web site. DCMI wishes to thank the National Library of Korea and the Korean Library Association for the splendid organization of the event.
Categories: Semantic Web

Semantics for the Rest of Us

Mon, 2009-11-02 17:25

We had a successful workshop at ISWC 2009 last week, titled "Semantics for the Rest of Us -- Variants of Semantic Web Languages in the Real World" (proceedings). Sandro Hawke from W3C gave the keynote talk, and we also had a panel discussion, chaired by Jim Hendler (my position statement slides are here).

Many thanks to everyone who participated (it was standing room only!). My particular thanks go to Lalana Kagal of MIT who did most of the work to get this workshop together.

Categories: Semantic Web

Promise hold (NYT and the LOD)

Mon, 2009-11-02 11:41

I was at the

Categories: Semantic Web

Evernote

Sun, 2009-11-01 22:51
With a scanner and some Python, I'm an enthusiastic convert to Evernote.
Categories: Semantic Web

RDF 2 Wishlist

Sat, 2009-10-31 01:28

Here’s what I think should be standardized at some point, soon, in the Semantic Web infrastructure. These items are at various levels of maturity; some are probably ready for a W3C Working Group right now, while others are in need of research. They are mostly orthogonal and most can be handled in independent efforts. (I would lean against forming a single RDF Working Group to handle all of this; that would be slower, I think.)

To be clear, when I say “RDF 2″ I mean it like

Categories: Semantic Web

RDF 2 Wishlist

Sat, 2009-10-31 01:28

Here’s what I think should be standardized at some point, soon, in the Semantic Web infrastructure. These items are at various levels of maturity; some are probably ready for a W3C Working Group right now, while others are in need of research. They are mostly orthogonal and most can be handled in independent efforts. (I would lean against forming a single RDF Working Group to handle all of this; that would be slower, I think.)

To be clear, when I say “RDF 2″ I mean it like

Categories: Semantic Web

Linked data at the New York Times: Exciting, but buggy

Fri, 2009-10-30 22:19

Yesterday at the International Semantic Web Conference, Evan Sandhaus of the New York Times unveiled data.nytimes.com, a site that publishes linked data for some parts of the Times’ index. To me, this was one of the most exciting announcements at the conference, and it caused quite a tweetstorm during and after Evan’s talk.

A bit of background: Every article published in the newspaper or on the website is tagged, classified and categorized in many ways by skilled editors. This metadata allows the creation of topic pages that automatically collect relevant articles for notable people, organisations, and events. Examples include Michelle Obama, Swine Flu (H1N1 Virus) and Wrestling.

What’s in the data? The dataset published yesterday contains information on each of the concepts that have a topic page. For now, it is limited to topic pages about people. The concepts are modelled in SKOS. The information attached to each concept consists mostly of links: to DBpedia, to Freebase, into the Times API (which is not available as RDF at this point), and of course to the corresponding topic page. This means that if you have a DBpedia URI for an especially notable entity, a high-quality New York Times topic page with the latest news about the topic is only two RDF links away. A notable feature of the links is that every single one has been manually reviewed, making this perhaps the highest-quality linkset in the LOD cloud.

How to get the data? This being linked data, every concept has a dereferenceable URI. Examples:

The site’s URI scheme follows one of the Cool URIs recipes: The identifiers above are resolvable, and by using content negotiation, web browsers are redirected to

http://data.nytimes.com/N13941567618952269073.html

which has a nicely formatted summary of the data available about Michelle Obama. Data browsers and other RDF-enabled clients, on the other hand, are redirected to

http://data.nytimes.com/N13941567618952269073.rdf

which has all the data goodness in RDF/XML.

There is also a dump: people.rdf. You can browse the data starting from the data.nytimes.com page. Everything is available under a CC-BY license.

Bugs and problems

This being a new dataset and the Times’ first foray into linked data, it turns out that the Beta label on the site is quite warranted. I will highlight four issues.

Data and metadata are mixed together. Let’s look at the data about Michelle Obama, available at the N13941567618952269073.rdf URI above. I’m reformatting the data into Turtle for legibility.

<http://data.nytimes.com/N13941567618952269073> a skos:Concept; skos:prefLabel "Obama, Michelle"; skos:definition "Michelle Obama is the first …"; skos:inScheme nyt:nytd_per; nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>; owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>; owl:sameAs <http://data.nytimes.com/obama_michelle_per>; owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

This makes perfect sense, it’s data about a person, modelled as a SKOS concept. But then it goes on:

<http://data.nytimes.com/N13941567618952269073> dc:creator "The New York Times Company"; time:start "2007-05-18"^^xsd:date; time:end "2009-10-08"^^xsd:date; dcterms:rightsHolder "The New York Times Company"^^xsd:string; cc:license "http://creativecommons.org/licenses/by/3.0/us/"; .

This is not data about Michelle Obama the person, it’s metadata about the data published by the NYT. It’s certainly not true that Michelle Obama was created by the New York Times, or that she “started” in 2007 (whatever that’s supposed to mean), and don’t even get me started on asserting a rights or a license over a person.

Note that the NYT team actually went through the effort of setting up separate URIs for Michelle the person (http://data.nytimes.com/N13941567618952269073), and for the HTML and RDF documents describing the concepts (http://data.nytimes.com/N13941567618952269073.html and http://data.nytimes.com/N13941567618952269073.rdf). The reason why linked data experts advocate this practice of having separate URIs is exactly because it enables separation of data and metadata: It lets you state some facts about the concepts, and other things about the documents that describe the concepts. This is what should be done in the data above: The metadata should not be asserted about the URI identifying Michelle, but about the URI identifying the document published by the NYT: N13941567618952269073.rdf. So we would get:

<http://data.nytimes.com/N13941567618952269073> a skos:Concept; skos:prefLabel "Obama, Michelle"; skos:definition "Michelle Obama is the first …"; skos:inScheme nyt:nytd_per; nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>; owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>; owl:sameAs <http://data.nytimes.com/obama_michelle_per>; owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>; <http://data.nytimes.com/N13941567618952269073.rdf> dc:creator "The New York Times Company"; time:start "2007-05-18"^^xsd:date; time:end "2009-10-08"^^xsd:date; dcterms:rightsHolder "The New York Times Company"^^xsd:string; cc:license "http://creativecommons.org/licenses/by/3.0/us/"; .

Eric Hellman has a post about this issue, calling it “a potential legal disaster” because a license is attached to a resource that’s said to be the same as a resource on a different site (DBpedia and Freebase). He’s a bit alarmist, but this example highlights why the separation of data and metadata, of concept URIs and document URIs, is critically important in a general-purpose data model.

Distinguishing URIs and literals. Here’s some selected snippets from the RDF/XML output:

<nyt:topicPage>http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html</nyt:topicPage> <cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License> <cc:Attribution>http://data.nytimes.com/N13941567618952269073</cc:Attribution>

The value of all three properties are URIs. In the RDF data model, URIs are of such central importance that they are treated differently from any other kind of value (strings, integers, dates). But not so in the code example above. There, the three URIs are encoded as simple strings. This should be:

<nyt:topicPage rdf:resource="http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html" /> <cc:License rdf:resource="http://creativecommons.org/licenses/by/3.0/us/" /> <cc:Attribution rdf:resource="http://data.nytimes.com/N13941567618952269073" />

Why does this matter? It’s basically like making links “clickable” in HTML by putting them into a <a href=”…”> tag: RDF clients will not recognize URIs if they are encoded as literals, and will not know that they can treat them as links that can be followed.

Content negotiation for hybrid clients. As usual for linked data emitting sites, there is content negotiation on the concept URIs: They redirect either to RDF or HTML, based on the Accept header sent by the client when resolving the URI via the HTTP protocol. Also as usual for first-time linked data producers, the content negotiation is a bit broken.

Here is what happens when I ask for HTML (using cURL, which is a handy tool for debugging the HTTP behaviour of linked data sites):

$ curl -I -H "Accept: text/html" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other Server: Apache/2.2.3 (Red Hat) Location: http://data.nytimes.com/N13941567618952269073.html

Next I will ask for RDF:

$ curl -I -H "Accept: application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other Server: Apache/2.2.3 (Red Hat) Location: http://data.nytimes.com/N13941567618952269073.rdf

So far, so good. But many clients are “hybrid”, they can consume both RDF and HTML. This includes many tools that can consume RDFa (RDF embedded in HTML pages). So it’s not uncommon to find tools that combine multiple media types in the accept header. The Times server should also redirect those tools to the RDF, because any RDF-consuming client can probably handle the raw RDF data better than the (not overly useful) HTML pages. But let’s see what happens:

$ curl -I -H "Accept: text/html,application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other Server: Apache/2.2.3 (Red Hat) Location: http://data.nytimes.com/N13941567618952269073.rdf.html

The server redirects to a file that doesn’t exist, ending in .rdf.html. This is pretty funny to me as a programmer, because the bug gives me a glimpse into the Times codebase, where obviously a programmer didn’t consider that the two alternatives—sending HTML or sending RDF—are exclusive.

Update: Someone at the Times seems to be working on the server as I’m writing this; the latest behaviour is even worse; it redirects to .rdf.html even if I request only RDF, and uses 301 redirects instead of 303.

Using the Creative Commons schema. The NYT data uses the Creative Commons schema to license the data under CC-BY. Here’s the relevant RDF, in Turtle (I fixed the subject URI and turned literals into URIs where appropriate):

<http://data.nytimes.com/N13941567618952269073.rdf> cc:License <http://creativecommons.org/licenses/by/3.0/us/>; cc:Attribution >http://data.nytimes.com/N13941567618952269073<; cc:attributionName "The New York Times Company"; .

This uses three properties: cc:License, cc:Attribution and cc:attributionName. But according to the schema, cc:License and cc:Attribution are classes, not properties. This should be:

<http://data.nytimes.com/N13941567618952269073.rdf> cc:license <http://creativecommons.org/licenses/by/3.0/us/>; cc:attributionURL <http://data.nytimes.com/N13941567618952269073>; cc:attributionName "The New York Times Company"; .

Summary. The Times’ foray into linked data is an exciting new development, but it also shows how hard it is to get linked data right. This is a weakness of the linked data approach.

Can we do anything about this? Better tutorials and education can probably help. Another activity that is trying to address the issue is the Pedantic Web Group, a loose collection of people like me who obsess about the technical details of publishing data on the web and work with data publishers to get issues like the above fixed. We might even give you a hand with reviewing your stuff before you go live with it.
Categories: Semantic Web

New York Times publishes Linked Open Data

Fri, 2009-10-30 19:00

Like many newspapers, the New York Times links the first mention of well known entitles in its articles to a reference page. For example, a mention of Barack Obama links to a page which is a collection of basic information on President Obama and links to relevant stories and other resources that the Times has created.

Now the Times is also using RDF to publish some of information as linked open data. Yesterday the Times announced the publication of an LOD collection covering about 5,000 people at http://data.nytimes.com/ under under a Creative Commons 3.0 Attribution License and plan to put their full collection of 30K topics online soon.

“Over the last several months we have manually mapped more than 5,000 person name subject headings onto Freebase and DBPedia. And today we are pleased to announce the launch of http://data.nytimes.com and the release of these 5,000 person name subject headings as Linked Open Data.

Over the next several months, we plan to expand http://data.nytimes.com to include each of the nearly 30,000 subject headings we use to power Times Topics pages, a collection that includes locations, organizations and descriptors in addition to person names.”

Categories: Semantic Web

DERI success at ISWC

Fri, 2009-10-30 17:20

DERI secured a number of awards at this year's International Semantic Web Conference. The prices are:

  • Winner: Best In-Use paper Produce and Consume Linked Data with Drupal! Stephane Corlosquet, Renaud Delbru, Tim Clark, Axel Polleres, Stefan Decker
  • 3rd Place Semantic Web Challenge: Sig.ma: live views on the Web of Data? Giovanni Tummarello, Richard Cyganiak, Michele Catasta, Szymon Danielczyk and Stefan Decker - Presented by Richard Cyganiak
  • 2nd Place Semantic Web Challenge: Interactive Exploration of Web Datasets with VisiNav Andreas Harth - Andreas Harth (ex-DERI member who recently joined AIFB)
  • Best paper award at the 5th International Workshop on SW-Enabled Software Engineering: http://www.abdn.ac.uk/~r01srt7/swese2009/ Implementing Semantic Web applications: reference architecture and challenges. Benjamin Heitmann, Sheila Kinsella, Conor Hayes and Stefan Decker
Categories: Semantic Web

ISWC2009 4-5

Fri, 2009-10-30 00:59

Fourth day

Shame on me, but I missed the morning keynote… I was a bit late arriving to the conference site and I got stuck in a conversation at breakfast. Things happen…

The most notable event in the morning, at least for me, was the SPARQL WG panel. All members of the Working Group (me included) were on the panel and the room was full. I mean, full, people were standing in the back. And I regard that as a success by itself, it shows not only the overall importance of SPARQL, but the real interest around the new version, ie, SPARQL 1.1 (in case you have missed it, the

Categories: Semantic Web

Public Library of Science deploys RDFa on all articles

Thu, 2009-10-29 21:11

Check out the PLoS Journals, like PLoS Medicine or PLoS Genetics, and you’ll find RDFa for all bibliographic information, including authors, categories, etc.

Categories: Semantic Web

Diane Mueller talks about financial data, XBRL and the Semantic Web

Thu, 2009-10-29 18:37

In my latest podcast I talk with Diane Mueller of JustSystems.

We discuss the evolution of the eXtensible Business Reporting Language (XBRL), consider some of the use cases for it, and look at its’ relevance to the Semantic Web community.

Towards the end of the conversation we refer to a recent workshop organised by XBRL International and the World Wide Web Consortium (W3C).

This conversation was recorded on Tuesday 22 September, 2009.

For other Talis podcasts in this Nodalities series, see here

Categories: Semantic Web

ISWC2009 2-3

Wed, 2009-10-28 14:50

Second day

In fact, there is much less to say… In the morning I was on two workshops; I was at the

Categories: Semantic Web

Some of my (very) preliminary opinions on Google Wave

Wed, 2009-10-28 14:33

I was interviewed by Marie Boran from Silicon Republic recently for an interesting article she was writing entitled “Will Google Wave topple the e-mail status quo and change the way we work?“. I thought that maybe my longer answers may be of interest and am pasting them below.

Disclaimer: My knowledge of Google Wave is second hand through various videos and demonstrations I’ve seen… Also, my answers were written pretty quickly!

As someone who is both behind Ireland’s biggest online community boards.ie and a researcher at DERI on the Semantic Web, are you excited about Google Wave?

Technically, I think it’s an exciting development – commercially, it obviously provides potential for others (Google included) to set up a competing service to us (!), but I think what is good is the way it has been shown that Google Wave can integrate with existing platforms. For example, there’s a nice demo showing how Google Wave plus MediaWiki (the software that powers the Wikipedia) can be used to help editors who are simultaneously editing a wiki page. If it can be done for wikis, it could aid with lots of things relevant to online communities like boards.ie. For example, moderators could see what other moderators are online at the same time, communicate on issues such as troublesome users, posts with questionable content, and then avoid stepping on each other’s toes when dealing with issues.

Does it potential for collaborative research projects? Or is it heavyweight/serious enough?

I think it has some potential when combined with other tools that people are using already. There’s an example from SAP of Google Wave being integrated with a business process modelling application. People always seem to step back to e-mail for doing various research actions. While wikis and the like can be useful tools for quickly drafting research ideas, papers, projects, etc., there is that element of not knowing who is doing stuff at the same time as you. Just as people are using Gtalk to augment Gmail by being able to communicate in contacts in real-time when browsing e-mails, Google Wave could potentially be integrated with other platforms such as collaborative work environments, document sharing systems, etc. It may not be heavyweight enough on its own but at least it can augment what we already use.

Where does Google Wave sit in terms of the development of the Semantic Web?

I think it could be a huge source of data for the Semantic Web. What we find with various social and collaborative platforms is that people are voluntarily creating lots of useful related data about various objects (people, events, hobbies, organisations) and having a more real-time approach to creating content collaboratively will only make that source of data bigger and hopefully more interlinked. I’d hope that data from Google Wave can be made available using technologies such as SIOC from DERI, NUI Galway and the Online Presence Ontology (something we are also working on).

If we are to use Google Wave to pull in feeds from all over the Web will both RSS and widgets become sexy again?

I haven’t seen the example of Wave pulling in feeds, but in theory, what I could imagine is that real-time updating of information from various sources could allow that stream of current information to be updated, commented upon and forwarded to various other Waves in a very dynamic way. We’ve seen how Twitter has already provided some new life for RSS feeds in terms of services like Twitterfeed automatically pushing RSS updates to Twitter, and this results in some significant amounts of rebroadcasting of that content via retweets etc.

Certainly, one of the big things about Wave is its integration of various third-party widgets, and I think once it is fully launched we will see lots of cool applications building on the APIs that they provide. There’s been a few basic demonstrator gadgets shown already like polls, board games and event planning, but it’ll be the third-party ones that make good use of the real-time collaboration that will probably be the most interesting, as there’ll be many more people with ideas compared to some internal developers.

Is Wave the first serious example of a communications platform that will only be as good as the third-party developers that contribute to it?

Not really. I think that title applies to many of the communications platforms we use on the Web. Facebook was a busy service but really took off once the user-contributable applications layer was added. Drupal was obviously the work of a core group of people but again the third-party contributions outweigh those of the few that made it.

We already have e-mail and IM combined in Gmail and Google Docs covers the collaborative element so people might be thinking ‘what is so new, groundbreaking or beneficial about Wave?’ What’s your opinion on this?

Perhaps the real-time editing and updating process. Often times, it’s difficult to go back in a conversation and add to or fix something you’ve said earlier. But it’s not just a matter of rewriting the past – you can also go back and see what people said before they made an update (“rewind the Wave”).

Is Google heading towards unified communications with Wave, and is it possible that it will combine Gmail, Wave and Google Voice in the future?

I guess Wave could be one portion of a UC suite but I think the Wave idea doesn’t encompass all of the parts…

Do you think Google is looking to pull in conversations the way FriendFeed, Facebook and Twitter does? If so, will it succeed?

Yes, certainly Google have had interests in this area with their acquisition of Jaiku some time back (everyone assumed this would lead to a competitor to Twitter; most recently they made the Jaiku engine available as open source). I am not sure if Google intends to make available a single entry point to all public waves that would rival Twitter or Facebook status updates, but if so, it could be a very powerful competitor.

Is it possible that Wave will become as widely used and ubiquitous as Gmail?

It will take some critical mass to get it going, integrating it into Gmail could be a good first step.

And finally – is the game changing in your opinion?

Certainly, we’ve moved from frequently updated blogs (every few hours/days) to more frequently updated microblogs (every few minutes/seconds) to being able to not just update in real-time but go back and easily add to / update what’s been said any time in the past. People want the freshest content, and this is another step towards not just providing content that is fresh now but a way of freshening the content we’ve made in the past.

Categories: Semantic Web

Linked Data Flows: A new picture to illustrate the “openness” we mean

Wed, 2009-10-28 12:47

(Original post taken from “About the Social Semantic Web“)

A lot of activities around Linking Open Data (“LOD”) and the associated data sets which are nicely visualised as a “cloud” are going on for quite a while now. It is exciting to see how the rather academic “Semantic Web” and all the work which is associated with this disruptive technology can be transformed now into real business use cases.

What I have observed in the last few months, especially in business communities, is the following:

  • “Linked Data” sounds interesting for the business people because the phrase creates a lot of associations in a second or two; also the database crowd seems to be attracted by this web-based approach of data integration
  • “Web of Data” is somehow misleading because many people think that this will be a new web which replaces something else. Same story with the “Semantic Web”
  • “Linking Open Data” sounds dangerous and not trustworthy to many companies

For insiders it is clear, that the “openness” of data, especially in commercial settings, can be controlled and has to be controlled in many cases i.e. by defining the right licensing models. But here we are still at the beginning as a workshop at ISWC 2009 has illustrated.

Anyway, looking at the characteristics of Linked Data Flows, they can be one-way or mutual. In some cases data from companies will be put into the cloud, and can be opened up for many purposes, in other use cases it will stay inside the boundaries. In other scenarios only (open) data from the web will be consumed and linked with corporate data, but no data will be exposed to the world (except the fact, that data was consumed by an entity).

And of course: On many other occasions datasets and repositories will be opened up partly depending on the CCs (or similar, not yet defined attributes) and the underlying privacy regulations one wants to use.

This makes clear that LOD / Linking Open Data is just one detail of a bigger picture. Since companies (and governments) play a crucial role to develop the whole infrastructure, we need to draw a new picture that illustrates the various Linked Data Flows in a better way:

Concluding from this the best thing would be to talk about Linked Data in general and just refer to Linking Open Data in the right context. Despite better knowledge for business people the term  “open” is still associated with “free” and “dubious provenance”. And given the fact that hardly anybody has given hard evidence on the ROI of open business models the “open argument” does count little in a time of decreasing economic prosperity.

So what would be critical to get the Linked Data thing running is to provide the corresponding business and licensing models for your Linked Data strategy. But this includes having a good understanding of the assets you want to capitalize. Given the fact that metada assets are still a novel and vastly unexplored business field which so far lack a regulated supply and demand structure there are still lots of structural obstacles that hinder the uptake of Linked Data. Providing more of the same in a laissez faire mode – like TimBL critisized at this year’s Web 2.0 Summit – might be inspiring for the in-crowd, but it might not be sufficient to build a linked data business.

Categories: Semantic Web

OWL 2 becomes a W3C recommendation

Wed, 2009-10-28 05:05

OWL 2, the new version of the Web Ontology Language, officially became a W3C standard yesterday. From the W3C press release:

“Today W3C announces a new version of a standard for representing knowledge on the Web. OWL 2, part of W3C’s Semantic Web toolkit, allows people to capture their knowledge about a particular domain (say, energy or medicine) and then use tools to manage information, search through it, and learn more from it. Furthermore, as an open standard based on Web technology, it lowers the cost of merging knowledge from multiple domains.”

Categories: Semantic Web

European Commission and the Data Overflow

Tue, 2009-10-27 19:29

The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.

Since the questionnaire is public, I am publishing my answers below.

  1. Data and data types

    1. What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?

      Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.

      This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.

      Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.

      The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.

      Relevant sections of this mass of data are a potential addition to any present or future analytics application.

      Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.

      Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models.

      By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas.

    2. What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)?

      All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.

      Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.

      Interleaving of all database functions and types becomes increasingly important.

  2. Industries, communities

    1. Who is producing these data and why? Could they do it better? How?

      Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).

      Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.

    2. Who is consuming these data and why? Could they do it better? How?

      Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird's eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages.

      Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a "meshup economy" where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.

    3. What industrial sectors in Europe could become more competitive if they became much better at managing data?

      Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media.

    4. Is the regulation landscape imposing constraints (privacy, compliance ...) that don't have today good tool support?

      The regulation landscape drives database demand through data retention requirements and the like.

      With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science.

      For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help.

      More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user's profile. This is a tall order and implementing something of the sort is an open question.

    5. What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.

      We have come across the following:

      • Knowing that the data exists in the first place.
      • If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.
      • Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data.

      Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this.

  3. Services, software stacks, protocols, standards, benchmarks

    1. What combinations of components are needed to deal with these problems?

      Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction.

      Interoperability and at least de facto standards in ways of doing this will emerge.

    2. What data exchange and processing mechanisms will be needed to work across platforms and programming languages?

      HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient.

      There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.

      For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.

    3. What data environments are today so wastefully messy that they would benefit from the development of standards?

      RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together.

      Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish.

      Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.

    4. What kind of performance is expected or required of these systems? Who will measure it reliably? How?

      Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases.

      The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.

      These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.

      We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today's best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached.

      The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.

      Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such.

      Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs.

      If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish.

      Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.

      Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding.

  4. Usability and training

    1. How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?

      In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL.

      Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned.

      There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.

      Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.

      For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.

      For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming.

      This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so.

      One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.

      This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.

    2. How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?

      For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.

      RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.

      A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.

      For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.

      The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.

      As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.

      If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination.

      The problem is creating the pull.

  5. Challenges

    1. What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ?

      The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.

      Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.

      The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.

      The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain.

      If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.

    2. What should one do to set up such a challenge, administer, and monitor it?

      The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed.

      The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.

      There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.

      Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.

Categories: Semantic Web