Thursday, June 09, 2011

Microdata and RDFa in TopBraid Composer

The next release of TopBraid Composer will include comprehensive support for editing and processing schema.org Microdata, and will also have improved support for RDFa. TopBraid is an extension of Eclipse and thus inherits a lot of goodness from the platform, including a very nice HTML editor. It was straight-forward and highly desirable to extend TopBraid with native support for those Web Data formats. Here is a preview of what it will look like.

Working with Microdata and RDFa

When I started exploring Microdata for my own web site, I created a new Eclipse project within TopBraid Composer containing the HTML, CSS and image files for the site.

While I was adding the Microdata tags to the HTML documents, I quickly discovered that RDF based tooling can be extremely helpful to make sure that the published metadata is consistent and of good quality. For example, data about entities (such as the http://schema.org/Person about myself) is split across multiple HTML pages: the front page contains my address, but my personal page contains information about my children. In such cases it is important that both pages use the same identifiers for the same Linked Data entities. This becomes even more important if we want to link to external standard vocabularies, such as ontologies about units, countries or product categories.

Linked Web Data is much more useful than isolated data snippets on individual pages.

As a result of this, I introduced the notion of Web Data Sites into TopBraid Composer - collections of pages in the same folder and its sub-folders. Right click on the project above and select New > Microdata Site File (or RDFa Site File). This opens a wizard with an option for default ontologies to include. For Microdata this is obviously the schema.org namespace, but any other RDF vocabulary can be added later:
This creates a site file (*.mds) that acts as a placeholder for all RDF triples on the HTML pages within the same folder and its subfolders. The site file can be opened like any other RDF data source, it can be imported into other data models, etc. When opened, it will scan the HTML files and always automatically stay up to date when the data on the HTML is changed.

The screenshot below (click on the image for the full size) shows some of the new TBC capabilities in practice.

You can see that TopBraid has built-in views to browse the class hierarchy, properties and instances. These are powerful mechanisms to navigate through the data space that is encoded in the HTML pages. In the example above, you can see that my current Microdata pages contain information about three Persons, as well as various address and location objects. The class tree shows the number of instances of each class. A double-click on an instance will display it on a form. You can see the form view of the resource http://knublauch.com (representing myself as a schema:Person) on the right. Here is a larger view, with the details of one of the children objects opened up:


Alternative views such as graphs and smart browser displays are also built-in. Here is a TBC graph view of some instances:


Analyzing Web Data with SPARQL and SPIN

You can also run SPARQL queries over this data:

We have a lot of SPARQL-based features built into the TopBraid platform, including the rule and constraint language SPIN (now a W3C Member Submission). SPIN is useful to define model-based integrity constraints, and I have started to create a SPIN constraints library for the schema.org namespace. Currently this checks that the value type of properties on the HTML pages matches the range defined by the ontology, but more checks will be added, for example regular expressions of emails, country abbreviations etc. More on this in a separate entry some day.

Editing Microdata and RDFa

Once you have checked constraints and the system reports a violation, you can navigate to the source of the violation on the form of the relevant instance. From those forms, you simply need to double-click on the icon to the left of the value to navigate to the HTML source code:

At this stage, the circle is completed and are in HTML document where you can fix problems (e.g. a misspelled email address). Save the HTML file, and the RDF triple (on the form and elsewhere) will update automatically.

The HTML editor in TopBraid Composer has been enhanced with syntax highlighting for the Microdata attributes such as itemprop. And more is on its way...

Harvesting Microdata and RDFa from the web

In addition to editing and processing local Web Data files, TopBraid can also be used to work with external mark-up from existing pages. TBC Version 3.5 had already introduced the Web Data Basket, and we have extended this to also support Microdata. The mechanism is simple yet powerful: you install a small Firefox extension that will send the pages you visit to your locally running TopBraid Composer. This will collect all RDF metadata contained on the visited pages, and make it available to the RDF, OWL and SPARQL machinery of TBC. This means you can simply browse the web and you will automatically get the stream of RDF triples into your working environment.