translators
translators copied to clipboard
[DISCUSS] Schema.org translator
This schema.org translator was done by @tgr during the Hackday at WikiCite 17.
This PR is to have some place to discuss how to continue from here.
Thanks Philip!
This is very much WIP (only looks for JSON-LD, probably does not recognize all the possible formats, only uses a small subset of attributes, only tested on a singe site (bbc.com)). Before going forward with it, there are some general problems which should be considered:
- JSON-LD documents can be very complex (multiple contexts, multiple ways to declare the type, named graphs, aliasing, nesting...) and parsing them by hand is not really maintainable. Something like jsonld.js should be made available to translators.
- microdata and RDFa are more straightforward but would ideally be parsed to the exact same JS object as their JSON-LD equivalent, to keep the business logic readable. Haven't looked close enough to say whether that's easy or not. RDFaJSON and microdata-node are two tools that claim to do this.
- A generic schema.org parser needs to coexist with page-specific translators. Right now the code in the patch will produce better results for many BBC pages than the dedicated bbc.com translator, but will be ignored due to that transator's lower priority. The current (not so great) workaround for that is that many transators internally invoke the generic translator (Embedded Metadata). Shoud the Schema.org translator be squashed into that one, or is there a better solution?
@tgr the plan (#917) was to include it in EM, not a separate translator
Even basic handling would be appreciated in the short-run, since it's been so sorely needed for so long!
I remembered that there is also another approach by @dgerber in https://github.com/dgerber/translators/tree/microdata which looks promising.
To keep things DRY and manageable, I think it's important to separate two steps:
- the parsing of specific formats -- JSON-LD, Microdata, RDFa, turtle, html headers, RDF/XML (this one is implemented in
Zotero.RDF
), etc. - the "semantic" translation, mapping vocabularies / ontologies -- Dublin core, schema.org, etc. -- to Zotero items.
Currently, the most complete vocabulary translation is done in RDF.js (and called from EM), which happens to also parse RDF/XML. So that branch puts schema.org-related code (mostly) in RDF.js, and the parsing in Microdata.js. (I remember I could not find a lean, easily integrable and correct js microdata parser then, but not which library had which issue.)
It would be nice to make all generic metadata easily available to page specific translators. However some web pages have different/conflicting metadata in different formats (e.g. JSON-LD, repeated but with errors as microdata), so simply merging by default in Embedded Metadata.js
what is parsed from JSON-LD or Microdata would break a few translators.
+1 to use a proper JSON-LD lib like jsonld.js.
More examples for schema.org for publications:
- https://www.dukeupress.edu/steeped-in-heritage
- http://nationalpost.com/news/world/from-bigly-to-me-donald-trump-redefines-the-political-lexicon
I have been working on expressing academic CVs in schema.org JSON+LD would those examples be helpful to test against? I know there's what one might build as a test and then there is real world data. I'm not sure if more item types and examples would be helpful.