I’ve touched on the issues with JSON-LD a little already, so I’m going to spell that out a little more here. If you’re not interested in the technical details, you can probably skip this post - honestly, I’m pretty much just rubber-ducking here.
Here’s the theory. I’m building a RESTful site generator. I settled on schema.org’s JSON-LD as a standard, though it’s kind of a wibbly-wobbly, ill-documented one. Each JSON file has its own url (and can theoretically be a static file). Based on the HTTP Accepts
header, a client can navigate the site based on the corresponding HTML without running a line of Javascript, or based on the JSON without having to parse HTML. Schema.org doesn’t quite allow for this, though.
Here’s a real-world schema.org implementation, at BBC News:
Each of those boxes is an HTML page; the JSON doesn’t stand alone. This isn’t a surprise - as I’ve mentioned, Google doesn’t recognize it outside of an HTML wrapper, and even then only recommends it if it’s physically there and not linked with a src=
. I can handle this behavior pretty simply by embedding the JSON in the corresponding HTML page. Smarter clients can still grab the JSON directly. It bloats the HTML, but what can you do?
However, there are a few structural issues I’d have in parsing this with a client. The two separate JSON structures both have an URL of the underlying HTML page, so it’s clear to my client they’re related. The ItemList
contains a list of urls, which is pretty much the only legal thing they can contain. So far, so good. But navigate to the article page, and there’s a discoverability problem. The Article
contains the article data, but the only links to other things in it are the image urls. If I enter the site on an article, the only navigation I can do is to go to the website root - which in this case doesn’t work, because only the /news hierarchy has microdata.
The Publisher
also exists only as substructures - there is no url or other unique identifier. Now, obviously the Beeb just has it in there because Google demands it and not because this is a RESTful interface, so it’s no big deal. Presumably whatever generates the data has the Publisher
info in a single place, so updating, say, the logo is something that propogates automatically, and changing the name isn’t ambiguous as to whether the Publisher
’s name is changing or the Publisher
entity is being replaced.
The lack of navigation, though, is an underlying problem in the schema. An Article
can have a Section
, but that’s just a text field - it can be isPartOf
a CreativeWork
, but there really isn’t a structure that would be an appropriate section.
Let’s look at the current(ish) Wirebird incarnation: a hyperlocal journalism site. (Yes, yes, “a blog.”) First of all, the current (inherited from its WordPress days) structure, which could be changed, but if Wirebird is supposed to be a generic site generator (yes, yes, “a CMS”) then it should be able to emulate it.
The site is entirely the publication/blog/CMS, so the base page is the base navigation point and we don’t have the BBC News issue. There are three sections - for residents, visitors, and businesses (the latter being B2B). We want site visitors to be able to go to the section appropriate to them and only see those articles, but the base page shows articles from all sections (currently, in straight-up reverse-chron order). Presently the url structure is completely flat, because (like this site) it used Plerd in the meantime… but really, the urls don’t matter for REST. So the base page (“index”) is a WebSite
. Now, the schema does say “Every web page is implicitly assumed to be declared to be of type WebPage, so the various properties about that webpage, such as breadcrumb
may be used. We recommend explicit declaration if these properties are specified, but if they are found outside of an itemscope, they will be assumed to be about the page.” That doesn’t necessarily help if I want my JSON objects to stand alone, but wait: you can also add an alternateType. So let’s do that.
That gives us breadCrumb
, which gives us upward discovery at least. It also gives us relatedLink
and significantLink
, which can be (arrays of) straight URLs, not schema objects, but that’s okay. Mostly, anyway. Tentatively, then, the significantLink
of the base page is a list of the sections, and also of the miscellaneous WebPage
s like “about” and “privacy” and such - basically, everything that gets linked from the menu bar in the html version. The relatedLink
list is all of the posts, even more tentatively. This gets used to build a Daring Fireball-like archive, but if it’s going to be a permanent part of the index page it’s probably going to involve pagination. Maybe it’ll just be the recent posts, and be the json equivalent of the atom and rss feeds? We’ll see.
The sections (not Section
s, because those aren’t an object) are a little uncertain. If the individual news articles slash blog entries are coded as BlogPosting
, then the sections can be Blog
s. But Google seems to consider those a little differently than Article
s and NewsArticle
s, as far as things like Google News is concerned. There’s no direct equivalent of a Blog
for news sites - there are PublicationIssue
s and PublicationVolume
s, but that’s doesn’t seem quite right for a continuously-updated news site. The sections can just be WebPage
alone, with entries as a significantLink
list, but that’s ambiguous: how do we tell the difference between an “about” page and a news section named “about”?
We can define sections as Blog
s and articles as both BlogPosting
and NewsArticle
in alternateType
, but the point of a schema starts to get lost if we layer too many types on (even though the differences between the types are negligible at this point). This might require some more consideration.