Planet RDF
Putting open Facebook data into Linked Data Cloud
I recently build a proof-of-concept demo on getting Facebook data (public data only) into LOD their recently announced Graph API. The demo is available at http://sam.tw.rpi.edu/ws/face_lod.html.
It is fairly straightforward to convert the JSON object into RDF and make the URI dereferenceable. Now the data are linkable, but not yet linked to other LOD data.
I did see some issues when I was assigning rdf properties. Here is an example JSON from http://graph.facebook.com/cocacola
{ "id": "40796308305", "name": "Coca-Cola", "picture": "http://profile.ak.fbcdn.net/object3/1853/100/s40796308305_2334.jpg", "link": "http://www.facebook.com/coca-cola", "category": "Consumer_products", "username": "coca-cola", "products": "Coca-Cola is the most popular and biggest-selling soft drink in history, as well as the best-known product in the world.\n\nCreated in Atlanta, Georgia, by Dr. John S. Pemberton, Coca-Cola was first offered as a fountain beverage by mixing Coca-Cola syrup with carbonated water. Coca-Cola was introduced in 1886, patented in 1887, registered as a trademark in 1893 and by 1895 it was being sold in every state and territory in the United States. In 1899, The Coca-Cola Company began franchised bottling operations in the United States.\n\nCoca-Cola might owe its origins to the United States, but its popularity has made it truly universal. Today, you can find Coca-Cola in virtually every part of the world.", "fan_count": 5425800 }1. The JSON file from Facebook is not using the exact Open Graph Protocol terms -below is the mapping
name => og:title category => og:type picture => og:image link => og:url2.2 we can reuse FOAF and DCTerms to cover
some terms used in Facebook data - below is the
mapping
Li Ding@RPI April 28, 2010
Map and Territory in RDF APIs
RDF specs and APIs have made a bit of a mess out of a couple pretty basic tools of math and computing: graphs and logic formulas. With the RDF next steps workshop coming up and Pat Hayes re-thinking RDF semantics Sandro thinking out loud about RDF2, I'd like us to think about RDF in more traditional terms. The scala programming language seems to be an interesting framework to explore how they relate to RDF.
The Feb 1999 RDF spec wasn't very clear about the map and the territory. It said that statements are made out of parts in the territory, rather than features on the map, which doesn't make very much sense. RDF APIs seem to inherit this confusion; e.g. from an RDF::Value class for ruby: Examples:Checking if a value is a resource (blank node or URI reference)
value.resourceBlank nodes and URI references are parts of the map; resources are in the territory.
Likewise in Package org.jrdf.graph:
Resource A resource stands for either a Blank Node or a URI Reference.
The 2004 RDF specs take great pains to clarify these use/mention distinctions, but they also go on at great length.
Let's review Wikipedia on graphs:
In mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. ... The edges may be directed (asymmetric) or undirected (symmetric) ... and the edges are called directed edges or arcs; ... graphs which have labeled edges are called edge-labeled graphs.
With that in mind, in the swap-scala project, we summarize the RDF abstract syntax as an edge-labelled directed graph with just one or two wrinkles:
package org.w3.swap.rdftrait RDFGraphParts {
type Arc = (SubjectNode, Label, Node)
type Node
type Literal <: Node
type SubjectNode <: Node
type BlankNode <: SubjectNode
type Label <: SubjectNode
}
The wrinkles are:
- Arcs can only start from BlankNodes or Labels, i.e. SubjectNodes
- Arcs labels may also appear as Nodes
We use another trait to relate concrete datatypes to these abstract types:
trait RDFNodeBuilder extends RDFGraphParts {def uri(i: String): Label
type LanguageTag = Symbol
def plain(s: String, lang: Option[LanguageTag]): Literal
def typed(s: String, dt: String): Literal
def xmllit(content: scala.xml.NodeSeq): Literal
}
This doesn't pin down what a Label is, but in any concrete implementation, you can build one from a String using the uri method. The RDFNodeBuilder trait is used to implement RDF/XML, RDFa, and turtle parsers that are agnostic to the concrete implementation of an RDF graph.
Now let's look at terms of first order logic:
The set of terms is inductively defined by the following rules:
- Variables. Any variable is a term.
- Functions. Any expression f(t1,...,tn) of n arguments (where each argument ti is a term and f is a function symbol of valence n) is a term.
This is represented straightforwardly in scala a la:
/**
* A Term is either a Variable or an FunctionTerm.
*/
sealed abstract class Term { ... }
class Variable extends Term { ...}
abstract class FunctionTerm() extends Term {
def fun: Any
def args: List[Term]
}
The core RDF doesn't cover all of first order logic; it corresponds fairly closely to the conjunctive query fragment:
The conjunctive queries are simply the fragment of first-order logic given by the set of formulae that can be constructed from atomic formulae using conjunction and existential quantification , but not using disjunction , negation , or universal quantification .
We can then excerpt just the relevant parts of the definition of formulas:
The set of formulas is inductively defined by the following rules:
- Predicate symbols. If P is an n-ary predicate symbol and t1, ..., tn are terms then P(t1,...,tn) is a formula.
- Binary connectives. If φ and ψ are formulas, then (φ ψ) is a formula. Similar rules apply to other binary logical connectives.
- Quantifiers. If φ is a formula and x is a variable, then and are formulas.
Our scala representation follows straightforwardly:
sealed abstract class ECFormula
case class Exists(vars: Set[Variable], g: And) extends ECFormula
sealed abstract class Ground extends ECFormula
case class And(fmlas: Seq[Atomic]) extends Ground
case class Atomic(rel: Symbol, args: List[Term]) extends Ground
Now that we have scala representations for RDF graphs and conjunctive query formulas, how do we relate them? This is the fun part:
package org.w3.swap.rdflogicimport swap.rdf.RDFNodeBuilder
import swap.logic1.{Term, FunctionTerm, Variable}
import swap.logic1ec.{Exists, And, Atomic, ECProver, ECFormula}
/**
* RDF has only ground, 0-ary function terms.
*/
abstract class Ground extends FunctionTerm {
override def fun = this
override def args = Nil
}
case class Name(n: String) extends Ground
case class Plain(s: String, lang: Option[Symbol]) extends Ground
case class Data(lex: String, dt: Name) extends Ground
case class XMLLit(content: scala.xml.NodeSeq) extends Ground
/**
* Implement RDF Nodes (except BlankNode) using FOL function terms
*/
trait TermNode extends RDFNodeBuilder {
type Node = Term
type SubjectNode = Term
type Label = Name
def uri(i: String) = Name(i)
type Literal = Term
def plain(s: String, lang: Option[Symbol]) = Plain(s, lang)
def typed(s: String, dt: String): Literal = Data(s, Name(dt))
def xmllit(e: scala.xml.NodeSeq): Literal = XMLLit(e)
}
The abstract RDFGraphBuilder node types are implemented as first order logic terms. For formulas, we use a "holds" predicate:
object RDFLogic extends ... {def atom(s: Term, p: Term, o: Term): Atomic = {
Atomic('holds, List(s, p, o))
}
def atom(arc: (Term, Term, Term)): Atomic = {
Atomic('holds, List(arc._1, arc._2, arc._3))
}
}
Then all the semantic machinery up to simple entailment between RDF graphs just falls out of conjunctive query.
I haven't done RDFS Entailment yet; the plan is to do basic rules first (N3rules or RIF BLD) and then use that for RDFS, OWL2-RL, and the like.
Sameas Network
Normal 0 false false false MicrosoftInternetExplorer4
Sameas Network is a network of URIs which are inter-connected by owl:sameAs relation. It is such an interesting network as it is not a conventional social network, but rather a socially contributed directed graph DAG connecting “equivalent” identity.
Our recent study [1] crawls sameas network following linked data principles: starting from a given seeding URI, we dereference the URI and recursively fetch URIs linked by owl:sameAs. We used a fairly small seeding set URIs of New York Times URIs (100 people, 100 locations and 100 organizations) and got 300 sameas networks. Please come to WebSci Poster Session today (April 26,2010) to see more discussions.
Results
The average size of sameas network is 22, and one of the largest networks has 58 URIs in network with 1249 sameas arcs. Not all URIs are dereferencable, and the dereferencable ones may be described by 1 to over a thousand triples.
Following are some interesting breaking observations as confirmed in several plotted sample sameas networks (They are breaking because they have not even been printed in our poster yet).
- New York Times(NYT) and DBpedia have different preferences on mutual sameas relation. It is interesting to see that NYT connect its numerical URI to a non-numeric URI in freebase.
- Many DBpedia URIs were connect not within DBpedia, but by freebase. In DBpedia, “dbpprop:redirect” property was used to connect equivalent URIs.
- Wrong links were introduced by freebase, dbpedia:Paul_Allen was linked to dbpedia:Paul_Allen’s_House.
Paul Allen and his House (People NYT)
Arctic (Location NYT)
Discussion
- A lot of URIs does not carry information or just did redirection (see my paper), so it would be useful to reduce skip these URIs to reduce the cost of linked data exploration. we can further reduce the cost of loading same As URI.
- Quality of sameas link causes a big concern, the legitmate use of freebase sameas realtions is debatable.
Comments from Tim Berner-Lee: let’s leverage semantics - we can look into the semantic annotations (e.g. rdf:type) of the URI being described to automatically infer potential bad data integration. Paul Allen’s House will be than knock out with its type being “house”.
[1] Ding, Li and Shinavier, Joshua and Finin, Tim and L. McGuinness, Deborah (2010) An Empirical Study of owl:sameAs Use in Linked Data. In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US. http://journal.webscience.org/403/
Li Ding @ RPI April 26, 2010
A Dynamic Web Of Data
As a matter of fact things change – the Web of Data is no exception in that respect. While some sources, such as Twitter, are intrinsically dynamic, others change every now and then, potentially in unforeseeable intervals. In the recent Talis Nodalities Magazine, we made a case for Keeping up with a LOD of changes; here I’m going to elaborate a bit more on the current state of Dataset Dynamics and its challenges.
Let us first step a back a bit and have a look what Dataset Dynamics are and why this is important. In the Web of Linked Data we typically deal with datasets, for example, from the biomedical domain or the media industry on the one hand, and entities, such as a certain protein or people on the other. For the entity-level case established HTTP caching mechanism can be leveraged (see the Caching Tutorial and Things Caches Do). Further, with Memento, a HTTP-based versioning mechanisms has been proposed as well as implemented, adding a “time dimension” to HTTP (see Fig. 1).
Fig. 1 Memento Framework (Source: "An HTTP-Based Versioning Mechanism for Linked Data" Herbert Van de Sompel, Robert Sanderson, Michael Nelson, Lyudmila Balakireva, Harihar Shankar, Scott Ainsworth, LDOW 2010)
Dataset-level changesHowever, tackling dataset-level changes is a rather new field with no agreed-upon, even less standardised solution handy. The main problem is that a dataset typically talks about many thousands to millions of distinct entities, which makes it impractical to apply entity-level solutions for a range of use cases, such as link maintenance or replication (see also Fig. 2).
Fig. 2 Change frequency vs. change volume
I often hear these days: “it seems there is no solution for handling of dataset-level changes”; nevertheless, I think quite the opposite it true. There are plenty of proposed solutions from both the academia and practitioners, targeting different challenges in the areas of:
- Change discovery – how do I find out about about dataset changes?
- Propagating changes - if there is a change, how is the change communciated to a consumer?
- Change semantics – how do I learn what has changed (has been added, removed, etc.)?
Some proposals on the table are integrated approaches (such as DSNotify, SemanticPingback, Talis Changeset) while others focus on certain aspects (like the dady vocabulary for discovery or the Graph Update Ontology for change semantics) or deal concrete environments, for example sparqlPuSH for SPARQL enpdoints.
A Dataset Dynamics ManifestoNo matter on what (set of) solutions the community eventually agrees on to address the handling of dataset-level changes, it should adhere to the following principles:
- light-weight
- distributed and scalable
- standards-based
Obviously, a light-weight (and ideally RESTful) approach lowers the barriers to adoption and enables a quick uptake. When I say light-weight, I mean it both in terms of protocol and code. It should be easy to integrate in RDF stores and libraries and available in all common Web programming languages including but not limited to Java, PHP, .NET family, etc.
Just as the Web of Data is a globally distributed dataspace, handling of changes should be done in a distributed fashion. There will be many different publishers and consumers (such as agents, indexer, consolidator platforms, etc.) of datasets with different requirements and capabilities. A distributed approach can cope with this challenge in a cost- and performance-efficient way. Tightly connected to this: It has to scale. Today, we’re dealing with some hundreds of LOD datasets. In the next couple of years, this will likely explode into the millions and hence one needs to be able to deal with such a growth. The same, just sooner, is true for the number of consumers of the changes.
Last but not least the Dataset Dynamics solution should be based on standards. It doesn’t necessarily need to be RDF for all of the challenges as outlined above. For example, Atom offers a standardised, extensible and widely accepted format to propagate changes; to take this further Pubsubhubbub can be utilised to enable a standardised, distributed publisher-subscriber scheme (Fig 3.)
Fig. 3 Pubsubhubbub - a standard-based, distributed publisher-subscriber-hub system (Source: http://docs.google.com/present/view?id=ajd8t6gk4mh2_34dvbpchfs)
As I’ve outlined above, it might still be too early for a conclusion on how to deal with dataset-level changes. However, people interested in this area have gathered already in the Dataset Dynamics group where solutions are discussed and implemented, potentially leading to a W3C standardisation work.
As an aside: in case you’re at the WWW2010 in Raleigh (NC, USA) these days, you may want to join the break-out meeting on Dataset Dynamics during the W3C Linked Open Data track on 29 April 2010.
(This blog post was written by Michael
Hausenblas)
Three principles for building government dataset catalog vocabulary
There are some ongoing interests in vocabulary for government dataset publishing. There are a number of proposals such as DERI dcat, Sunlight Lab’s guidelines and RPI’s proposal on Data-gov Vocabulary. Based on our experiences on data.gov catalog data, we found the following principles are useful for consolidate the vocabulary building process and potentially bring consensus:
- 1. modular vocabulary with minimal core
- keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
- allow extensions contributed by anyone. Extensions should be connected to the core ontology and be possible to be promoted to core status later.
- 2. choice of term
- make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
- make it clear on the expected range of term , e.g. should they use “New York” or “dbpedia:New_York” for spatial coverage? does it require a controlled vocabulary? A validator would be very helpful
- make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
- try to reuse a term from existing popular vocabulary
- identify the required, recommended, and optional terms
- 3. best practices for actual usage
- we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don’t know much about the semantic web.
- we should consider make vocabulary available in different formats for a wider range of users , e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
- we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption
comments are welcome.
Li Ding @ RPI
RDFa 1.1 version of the pyRdfa distiller
W3C Invites Comments on First Drafts of RDFa Core 1.1, XHTML+RDFa 1.1
The RDFa Working Group has published the First Public Working Drafts of RDFa Core 1.1 and XHTML+RDFa 1.1. RDFa Core is a specification for attributes to express structured data in any markup language. XHTML+RDFa 1.1 is an application of RDFa Core 1.1 for XHTML. Both of these documents are expected to supersede the RDFa in XHTML (RDFa 1.0) specification. Together, these specifications enable the human-readable and machine-readable markup of people, places, events, products, recipes, social networks, and many other concepts that are frequently published on the web. These documents improve upon RDFa 1.0 by adding a number of Web community requested features to ease authoring. Comments are welcome and should be sent to the public-rdfa-wg@w3.org mailing list (see also the archives).
Facebook adopts RDFa
Yesterday Facebook announced Opengraph:
The Open Graph protocol enables you to integrate your web pages into the social graph. It is currently designed for web pages representing profiles of real-world things — things like movies, sports teams, celebrities, and restaurants. Once your pages become objects in the graph, users can establish connections to your pages as they do with Facebook Pages. Based on the structured data you provide via the Open Graph protocol, your pages show up richly across Facebook: in user profiles, within search results and in News Feed.
Opengraph uses the Open Graph Protocol, which uses RDFa and “enables any web page to become a rich object in a social graph. For instance, [...] it enables any web page to have the same functionality as a Facebook Page.”
Initial publishers using Open Graph Protocol include IMDb, Microsoft, NHL, Posterous, Rotten Tomatoes, TIME, and Yelp.
Social Semantic Web dawning?
Facebook — Open Graph — Semantic Search
Alex Wilhelm from The Next Web writes:
There is data outside of Facebook that the company wants to be brought in and made relevant inside of the Facebook platform. Enter the Open Graph protocol, Facebook’s way to say, in the common tongue ”all your graph are belong to Zuck.”
The product combines graphs, be they music graphs from Pandora or what have you, into the Facebook wider social graph. You can think of it has a “knit-up” with Facebook for other websites that are not Facebook affiliated.
Nick O’Neill from AllFacebook:
If HTML is the way developers get information into Google’s search engine, meta data is the way developers will get data into Facebook’s semantic search engine which will be based on the company’s “Open Graph”. Through the use of easy to implement plugins, Facebook is rapidly collecting structured data on every user. Facebook has also upgraded their API to make building on top of the Open Graph a much easier process. What’s pretty clear is that it’s an attempt to tackle the residing search giant.
[...] As Mark Zuckerberg said on stage an hour ago, by the end of the day Facebook should have more than 1 billion likes and that data will grow exponentially.
[...] There are a number of standards that have been created in the past as some developers have pointed out, microformats being the most widely accepted version, however the reduction of friction for implementation means that Facebook has a better shot at more quickly collecting the data. The race is on for building the semantic web and now that developers and website owners have the tools to implement this immediately.
Semantic web and Drupal video
Two years ago at DrupalCon Boston, I declared that we should embrace the semantic web, and that, as a first step, we should add RDFa support to Drupal core. Since then, I've written extensively about the importance of semantic technologies in Drupal, and how I believe Drupal can play an important role in helping to bootstrap the semantic web.
Drupal 7, the next major release of Drupal, will ship with RDFa support directly "out of the box." To help people understand what is possible with RDFa support and how it enables us to do cool new things, take a look at the video below. I showed this video in my DrupalCon San Francisco keynote, two years after my initial RDFa challenge at DrupalCon Boston.
There are many other things we can build on top of this core support, but this is a start. I'm personally very excited to see this vision being realized, and I'm very thankful to all the people that helped make this possible. Kudos to Lin Clark from DERI, NUI Galway for her work on building the demo and recording the screencast, and to Stéphane Corlosquet from MIND Informatics at Mass General for leading the RDF in Drupal 7 efforts. The newly launched Semantic Drupal website contains articles, video tutorials and news on building Linked Data sites with Drupal 7.
Talking with Pro Tsiavos
Opening up data in the scientific community is something that’s become increasingly important, and getting the licensing and rights in order is a matter of urgent attention. With more and more researchers needing more and more data, there is an increased need for there to be clear information on what can be done with which data.
I spoke with LSE Research Fellow Dr. Prodromos (Pro) Tsiavos, who is researching open licensing, data sharing and publishing data in the public sector and who is also a Legal Project Lead for Creative Commons England and Wales. Pro’s main focus has been with the Cultural commons, and we discussed culture-shifts in science, and how the expectation that data are shared and licensed changes the way research may be done in future.
SPARQL + pubsubhubbub = sparqlPuSH
There have been lots of discussion recently regarding dynamics
and notification in the Semantic Web realm, including various
vocabularies for describing changes and approaches for notifying
them - as Leigh recently
blogged
about it.
Last month, while visiting Kno.e.sis, Pablo an I worked
on an approach using pubsubhubbub for RDF
changes notification, that I'm happy to announce today.
The result is sparqlPuSH, an interface
that can be plugged on any SPARQL endpoint and that broadcast
notifications to clients interested in what's happening in the
store using the pubsubhubbub protocol. At a glance, anyone can
register a particular query to the RDF store (e.g. list all
microblog posts, or list any changes made by X, using the Changesets vocabulary)
and results are provided in an RSS / Atom feed that is then sync-ed
using pubsubhubbub: each time new data corresponding the register
query is added into the store, the store itself notifies the
interested parties of such updates.
Practically, this means that you can be notified in real-time of
any change happening in a SPARQL endpoint.
The following video describes how the approach works as well as
shows a related use-case and you can download its source at
http://code.google.com/p/sparqlpush/.
It can be used as an interface on the top of any SPARQL endpoint
and also comes with an ARC2
interface (if you're using a different endpoint, the interactions
happen via HTTP and use requires that your endpoint provides JSON
SPARQL query results).
We believe that a push system like this for RDF notification can change lots of things regarding RDF data management and how to make sense of it, in real-time. In addition, we hope that such approach could be generalised not only to SPARQL endpoints, but to resource themselves, so that one resource can ping a pubsubhubbub hub when it changes, the notifications being then broadcasted to interested parties.
Data 3.0 (a Manifesto for Platform Agnostic Structured Data) Update 3
After a long period of trying to demystify and unravel the wonders of standards compliant structured data access, combined with protocols (e.g., HTTP) that separate:
- Identity,
- Access,
- Storage,
- Representation, and
- Presentation.
I ended up with what I can best describe as the Data 3.0 Manifesto. A manifesto for standards complaint access to structured data object (or entity) descriptors.
Some Related WorkAlex James (Program Manager Entity Frameworks at Microsoft), put together something quite similar to this via his Base4 blog (around the Web 2.0 bootstrap time), sadly -- quoting Alex -- that post has gone where discontinued blogs and their host platforms go (deep deep irony here).
It's also important to note that this manifesto is also a variant of the TimBL's Linked Data Design Issues meme re. Linked Data, but totally decoupled from RDF (data representation formats aspect) and SPARQL which -- in my world view -- remain implementation details:
Anyway, here's the Data 3.0 manifesto:
- An "Entity" is the "Referent" of an "Identifier."
- An "Identifier" SHOULD provide a global, unambiguous, and unchanging (though it MAY be opaque!) "Name" for its "Referent".
- A "Referent" MAY have many "Identifiers" (Names), but each "Identifier" MUST have only one "Referent".
- Structured Entity Descriptions SHOULD be based on the Entity-Attribute-Value
(EAV) Data Model, and SHOULD therefore take the form of one or
more 3-tuples (triples), each comprised of:
- an "Identifier" that names an "Entity" (i.e., Entity Name),
- an "Identifier" that names an "Attribute" (i.e., Attribute Name), and
- an "Attribute Value", which may be an "Identifier" or a "Literal".
- Structured Descriptions SHOULD be CARRIED by "Descriptor Documents" (i.e., purpose specific documents where Entity Identifiers, Attribute Identifiers, and Attribute Values are clearly discernible by the document's intended consumers, e.g., humans or machines).
- Structured Descriptor Documents can contain (carry) several Structured Entity Descriptions
- Stuctured Descriptor Documents SHOULD be network accessible via network addresses (e.g., HTTP URLs when dealing with HTTP-based Networks).
- An Identifier SHOULD resolve (de-reference) to a Structured Representation of the Referent's Structured Description.
- Referent, Identifier, and Descriptor/Sense (The Data Perception Trinity) illustration
- Referent, Identifier, and Descriptor/Sense Trinity (as exploited in FOAF+SSL based Secure WebIDs) illustration
- Demystifying Linked Data via EAV Model based Structured Descriptions
- What do people have against URIs and URLs?
- The URI, URL, and Linked Data Meme's Generic HTTP URI
- Simple Explanation of RDF and Linked Data Dynamics
- Linked Data and Identity
- FOAF+SSL FAQ
- LOD Community Thread (showing evolution of this manifesto based on feedback from members such as Richard Cyganiak).
- Googlebase Data API Docs
- Google Data Protocol (GData)
- Microsoft's OData Protocol
- Magic of De-referencable Names and actual Data via Binky Video
RDF Dataset Notifications
Like many people in the RDF community I’ve been thinking about the issue of syndicating updates to RDF datasets. If we want to support truly distributed aggregation and processing of data then we need an efficient way to share updates.
There’s been a lot of experimentation around different mechanisms, and PubSubHubbub seems to be a current favourite approach. I’ve been playing with it myself recently and have hacked up a basic push mechanism around Talis Platform stores. More on that another time.
But I’ve not yet seen any general discussion about the merits of different approaches, or even discussion about what it is that we really want to syndicate.
So let’s take it from the top.
It seems to me that there’s basically three broad categories of information we want to syndicate:
- Dataset Notifications — has a new dataset been added to a directory? has one been updated in some way, e.g. through the addition or removal of triples?
- Resource Notifications — what resources have been added or modified within a dataset?
- Triple Notifications — what triples have been changed within a dataset?
Each one of these categories is syndicating a different level of detail and may benefit from a different technical approach. For example there’s a different volume of information being exchanged if one is simply notifying dataset changes vs every triple. We’ll also likely need a different format or syntax.
Actually there may be a fourth category: notifications of graph structural changes to a dataset, e.g. adding or removing named graphs. I’ve not yet seen anyone exploring that level of syndication, but suspect it may be very useful.
Now, for each of those different categories, there are two different styles of notifications: push or pull. Pull mechanisms are typified by feed subscriptions, crawlers, or repeated queries of datasets. Push mechanisms are usually based on some form of publish-subscribe system.
Given those different scenarios, we can take a look at some existing technologies and categorise them. I’ve done just that and published a simple Google spreadsheet with my first stab at this analysis. (This probably needs a little more context in places but hopefully the classifications are fairly obvious).
PubSubHubbub seems to offer the most flexibility in that it mixes a standard Pull based Feed architecture with a Push based subscription system. Clearly worthy of the attention its getting. Other technologies offer similar features but are optimised for different purposes.
However that doesn’t mean that PubSubhubbub is just perfect out of the box. For example it’s worth noting that consumers aren’t required to use the Push aspects of the system, they can just subscribe to the feeds. So you need to be prepared to scale a PubSubHubbub system just as you would a Pull based Feed.
It may also be sub-optimal for systems which are syndicating out high-volume Triple level updates. The Feeds can potentially get very large and the hub system needs to be prepared to handle large exchanges. It also doesn’t say anything about how to catch-up or recover from missed updates. A hybrid approach may be required to cover for all use cases and scenarios and to produce a robust system.
In order to be able to properly compare different approaches we need to understand their respective trade-offs. I’m hoping this posting contributes to that discussion and can complement the ongoing community experimentation.
Am interested to hear your thoughts.
Abecedary
DC-2010 Call for Papers now closed
Sören Auer: “Establishing a network effect around linked data is the most important R&D goal for the near future.”
Leipzig is one of Germany’s Semantic Web hotspots. From May 5-6, 2010 the annual Semantic Web Day provides the opportunity to catch up with latest developments especially in the domain of Linked Data and the foundation of the German chapter of the Open Knowledge Foundation. Organizer Sören Auer gave us some background information.
From May 5 – 6, 2010 the 3rd Semantic Web Day in Leipzig will take place. What will be this year’s topics? Who should attend?The Semantic Web Day is targeting IT people, software developers, decision makers and users interested in learning about the potential of semantic technologies. The language during the event is German, so primarily Austrians, Swiss and Germans will attend. Beside semantic technologies a particular focus of this years event is open data in governments, public administrations and science. Although the programme is not yet finalized we already compiled an interesting number of talks and presentations including talks about the open biodiversity database Fishbase, the European Digital Library Europeana, a Linked Data project of the German Umweltbundesamt, use case presentations in the pharma, publishing and telecommunication industries and many more (cf. http://aksw.org/LSWT). Also, in addition to AKSW the Topic Maps Lab and the Web Data Integration Labs from Universität Leipzig be present at LSWT.
One of the highlights of this year`s Semantic Web Day is the official institutionalization of the German Chapter of the Open Knowledge Foundation. How did this come around? What does this mean for the OKF as a whole?OKFN started to work in 2006 and since then managed to sucessfully complete a number of projects facilitating open knowledge. In particular, the Comprehensive Knowledge Archive Network (CKAN), the OKCon conference series, the open knowledge definition and recently OKFN’s involvement in the launch of data.gov.uk are prominent examples of OKFN’s successful work. However, many of the OKFN activities were primarily driven by an active group of volunteers in the UK. With the official launch of the German OKFN branch we will strengthen the international dimension of OKFN’s work. Especially for Germany, where data privacy and security are perceived to be most important, raising awareness for enabling open, standards compliant access to public information will be an important target of OKFN’s activities.
The InFAI has become one of the hotspots in Semantic Web development in Germany over the past few years. What are you working on at the moment? What are the most interesting research and development aspects for the near future?From our point of view establishing a network effect around the publishing and use of linked data is the most important research and development goal for the near future. We just completed a first draft and implementations of a semantic enabled pingback method (http://aksw.org/Projects/SemanticPingBack), which applies a similar peer notification mechanism to linked data endpoints as it is widely deployed on the blogosphere. Other important research issues we are tackling with our partners are closing the performance gap between RDF and relational data management, increasing the coherence and quality of linked data and the provisioning of adaptive user interfaces for authoring and maintaining information on the data web.
About Sören AuerDr. Sören Auer leads the research group Agile Knowledge Engineering and Semantic Web (AKSW) at University of Leipzig. His research interests include Semantic Web technologies, knowledge representation, engineering and management, agile methodologies as well as databases and information systems. Sören is founder (respectively co-founder) of several high-impact research and community projects such as the Wikipedia semantification project DBpedia, the open-source innovation platform Cofundos.org or the social Semantic Web toolkit OntoWiki. Sören is author of over 50 peer-reviewed scientific publications, co-organiser of several workshops, chair of the Social Semantic Web conference 2007 and I-Semantics 2008, serves as an expert for industry, the European Commission, the W3C and is member of the advisory board of the Open Knowledge Foundation.
Could having two RDF-in-HTMLs actually be handy?
Another example is something I've been working on during the last months: I somehow managed to combine essential parts of Paggr (a drag&drop portal system based on RDF- and SPARQL-based widgets) with an RDF CMS (to be open-source, I'm currently looking for pilot projects). And although I decided to switch entirely to Microdata for semantic markup after exploring it during the FanHubz project, I wonder if there might be room for having two separate semantic layers in this sort of widget-based websites. Here is why:
As mentioned, I've taken a widget-like approach for the CMS. Each page section is a resource on its own that can be defined and extended by the web developer, it can be styled by themers, and it can be re-arranged and configured by the webmaster. In the RDF CMS context, widgets can easily integrate remote data, and when the integrated information is exposed as machine-readable data in the front-end, we can get beyond the "just-visual" integration of current widget pages and bring truly connectable and reusable information to the user interface.
Ideally, both the widgets' structural data and the content can be re-purposed by other apps. Just like in the early days of the Web, we could re-introduce a copy & paste culture of things for people to include in their own sites. With the difference that RDF simplifies copy-by-reference and source attribution. And both developers and end-users could be part of the game this time.
Anyway, one technical issue I encountered is when you have a page that contains multiple page items, but describes a single resource. With a single markup layer (say Microdata), you get a single tree where the context of the hierarchy is constantly switching between structural elements and content items (page structure -> main content -> page layout -> widget structure -> widget content). If you want to describe a single resource, you have to repeatedly re-introduce the triple subject ("this is about the page structure", "this is about the main page topic"). The first screenshot below shows the different (grey) widget areas in the editing view of the CMS. In the second screenshot, you can see that the displayed information (the marked calendar date, the flyer image, and the description) in the main area and the sidebar is about a single resource (an event).
Trice CMS editing view
Trice CMS page view with inline widgets describing one resource
If I used two separate semantic layers, e.g. RDFa for the content (the event description) and Microdata for the structural elements (column widths, widget template URIs, widget instance URIs), I could describe the resource and the structure without repeating the event subject in each page item.
To be honest, I'm not sure yet if this is really a problem, but I thought writing it down could kick off some thought processes (which now tend towards "No"). Keeping triples as stand-alone-ish as possible may actually be an advantage (even if subject URIs have to be repeated). No semantic markup solution so far provides full containment for reliable copy & paste, but explicit subjects (or "itemid"s in Microdata-speak) could bring us a little closer.
Conclusions? Err.., none yet. But hey, did you see the cool CMS screenshots?