I’ve touched on the issues with JSON-LD a little already, so I’m going to spell that out a little more here. If you’re not interested in the technical details, you can probably skip this post - honestly, I’m pretty much just rubber-ducking here.
Here’s the theory. I’m building a RESTful site generator. I settled on schema.org’s JSON-LD as a standard, though it’s kind of a wibbly-wobbly, ill-documented one. Each JSON file has its own url (and can theoretically be a static file). Based on the HTTP
Here’s a real-world schema.org implementation, at BBC News:
Each of those boxes is an HTML page; the JSON doesn’t stand alone. This isn’t a surprise - as I’ve mentioned, Google doesn’t recognize it outside of an HTML wrapper, and even then only recommends it if it’s physically there and not linked with a
src=. I can handle this behavior pretty simply by embedding the JSON in the corresponding HTML page. Smarter clients can still grab the JSON directly. It bloats the HTML, but what can you do?
However, there are a few structural issues I’d have in parsing this with a client. The two separate JSON structures both have an URL of the underlying HTML page, so it’s clear to my client they’re related. The
ItemList contains a list of urls, which is pretty much the only legal thing they can contain. So far, so good. But navigate to the article page, and there’s a discoverability problem. The
Article contains the article data, but the only links to other things in it are the image urls. If I enter the site on an article, the only navigation I can do is to go to the website root - which in this case doesn’t work, because only the /news hierarchy has microdata.
Publisher also exists only as substructures - there is no url or other unique identifier. Now, obviously the Beeb just has it in there because Google demands it and not because this is a RESTful interface, so it’s no big deal. Presumably whatever generates the data has the
Publisher info in a single place, so updating, say, the logo is something that propogates automatically, and changing the name isn’t ambiguous as to whether the
Publisher’s name is changing or the
Publisher entity is being replaced.
The lack of navigation, though, is an underlying problem in the schema. An
Article can have a
Section, but that’s just a text field - it can be
CreativeWork, but there really isn’t a structure that would be an appropriate section.
Let’s look at the current(ish) Wirebird incarnation: a hyperlocal journalism site. (Yes, yes, “a blog.”) First of all, the current (inherited from its WordPress days) structure, which could be changed, but if Wirebird is supposed to be a generic site generator (yes, yes, “a CMS”) then it should be able to emulate it.
The site is entirely the publication/blog/CMS, so the base page is the base navigation point and we don’t have the BBC News issue. There are three sections - for residents, visitors, and businesses (the latter being B2B). We want site visitors to be able to go to the section appropriate to them and only see those articles, but the base page shows articles from all sections (currently, in straight-up reverse-chron order). Presently the url structure is completely flat, because (like this site) it used Plerd in the meantime… but really, the urls don’t matter for REST. So the base page (“index”) is a
WebSite. Now, the schema does say “Every web page is implicitly assumed to be declared to be of type WebPage, so the various properties about that webpage, such as
breadcrumb may be used. We recommend explicit declaration if these properties are specified, but if they are found outside of an itemscope, they will be assumed to be about the page.” That doesn’t necessarily help if I want my JSON objects to stand alone, but wait: you can also add an alternateType. So let’s do that.
That gives us
breadCrumb, which gives us upward discovery at least. It also gives us
significantLink, which can be (arrays of) straight URLs, not schema objects, but that’s okay. Mostly, anyway. Tentatively, then, the
significantLink of the base page is a list of the sections, and also of the miscellaneous
WebPages like “about” and “privacy” and such - basically, everything that gets linked from the menu bar in the html version. The
relatedLink list is all of the posts, even more tentatively. This gets used to build a Daring Fireball-like archive, but if it’s going to be a permanent part of the index page it’s probably going to involve pagination. Maybe it’ll just be the recent posts, and be the json equivalent of the atom and rss feeds? We’ll see.
The sections (not
Sections, because those aren’t an object) are a little uncertain. If the individual news articles slash blog entries are coded as
BlogPosting, then the sections can be
Blogs. But Google seems to consider those a little differently than
NewsArticles, as far as things like Google News is concerned. There’s no direct equivalent of a
Blog for news sites - there are
PublicationVolumes, but that’s doesn’t seem quite right for a continuously-updated news site. The sections can just be
WebPage alone, with entries as a
significantLink list, but that’s ambiguous: how do we tell the difference between an “about” page and a news section named “about”?
We can define sections as
Blogs and articles as both
alternateType, but the point of a schema starts to get lost if we layer too many types on (even though the differences between the types are negligible at this point). This might require some more consideration.