translators icon indicating copy to clipboard operation
translators copied to clipboard

[DISCUSS] Schema.org translator

Open zuphilip opened this issue 7 years ago • 6 comments

This schema.org translator was done by @tgr during the Hackday at WikiCite 17.

This PR is to have some place to discuss how to continue from here.

zuphilip avatar Jun 02 '17 17:06 zuphilip

Thanks Philip!

This is very much WIP (only looks for JSON-LD, probably does not recognize all the possible formats, only uses a small subset of attributes, only tested on a singe site (bbc.com)). Before going forward with it, there are some general problems which should be considered:

  • JSON-LD documents can be very complex (multiple contexts, multiple ways to declare the type, named graphs, aliasing, nesting...) and parsing them by hand is not really maintainable. Something like jsonld.js should be made available to translators.
  • microdata and RDFa are more straightforward but would ideally be parsed to the exact same JS object as their JSON-LD equivalent, to keep the business logic readable. Haven't looked close enough to say whether that's easy or not. RDFaJSON and microdata-node are two tools that claim to do this.
  • A generic schema.org parser needs to coexist with page-specific translators. Right now the code in the patch will produce better results for many BBC pages than the dedicated bbc.com translator, but will be ignored due to that transator's lower priority. The current (not so great) workaround for that is that many transators internally invoke the generic translator (Embedded Metadata). Shoud the Schema.org translator be squashed into that one, or is there a better solution?

tgr avatar Jun 03 '17 12:06 tgr

@tgr the plan (#917) was to include it in EM, not a separate translator

Even basic handling would be appreciated in the short-run, since it's been so sorely needed for so long!

owcz avatar Jul 09 '17 21:07 owcz

I remembered that there is also another approach by @dgerber in https://github.com/dgerber/translators/tree/microdata which looks promising.

zuphilip avatar Jul 10 '17 10:07 zuphilip

To keep things DRY and manageable, I think it's important to separate two steps:

  1. the parsing of specific formats -- JSON-LD, Microdata, RDFa, turtle, html headers, RDF/XML (this one is implemented in Zotero.RDF), etc.
  2. the "semantic" translation, mapping vocabularies / ontologies -- Dublin core, schema.org, etc. -- to Zotero items.

Currently, the most complete vocabulary translation is done in RDF.js (and called from EM), which happens to also parse RDF/XML. So that branch puts schema.org-related code (mostly) in RDF.js, and the parsing in Microdata.js. (I remember I could not find a lean, easily integrable and correct js microdata parser then, but not which library had which issue.)

It would be nice to make all generic metadata easily available to page specific translators. However some web pages have different/conflicting metadata in different formats (e.g. JSON-LD, repeated but with errors as microdata), so simply merging by default in Embedded Metadata.js what is parsed from JSON-LD or Microdata would break a few translators.

+1 to use a proper JSON-LD lib like jsonld.js.

dgerber avatar Jul 13 '17 13:07 dgerber

More examples for schema.org for publications:

  • https://www.dukeupress.edu/steeped-in-heritage
  • http://nationalpost.com/news/world/from-bigly-to-me-donald-trump-redefines-the-political-lexicon

zuphilip avatar Jan 21 '18 23:01 zuphilip

I have been working on expressing academic CVs in schema.org JSON+LD would those examples be helpful to test against? I know there's what one might build as a test and then there is real world data. I'm not sure if more item types and examples would be helpful.

HughP avatar Mar 15 '21 20:03 HughP