ENH: parse schema.org/NewsArticle RDFa, Microdata, or JSONLD
Issue by westurner
Wed Sep 27 19:51:34 2017
Originally opened as https://github.com/codelucas/newspaper/issues/448
Schema.org Linked Data:
- http://schema.org/NewsArticle
- http://schema.org/docs/news.html
Parsers:
- http://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.graph.Graph.parse
- https://rdflib.readthedocs.io/en/stable/_modules/rdflib/plugins/parsers/pyRdfa.html#pyRdfa
- https://rdflib.readthedocs.io/en/stable/_modules/rdflib/plugins/parsers/pyMicrodata/microdata.html
- https://github.com/RDFLib/rdflib-jsonld
Tasks:
- [ ] sniff(fileobj, location)
- [ ] parse(fileobj, location)
- rdflib.Graph().parse(fileobj, uri, format='rdfa'|'microdata'|'jsonld') # for each
Comment by westurner
Wed Sep 27 19:55:59 2017
#251 mentions itemProp="datePublished" (Microdata)
The RDFa for this property could be:
- property="http://schema.org/datePublished"
- property="datePublished"
- property="schema:datePublished"
- property="http://schema.org/datePublished" content="ISO8601date+timeZ'
Comment by westurner
Sat Oct 27 21:09:59 2018
Extruct is the best tool for accomplishing this, IMHO https://github.com/scrapinghub/extruct
extruct.extract()https://github.com/scrapinghub/extruct/blob/master/extruct/_extruct.py
https://github.com/RDFLib/rdflib/issues/770#issuecomment-433655142
Comment by simonm3
Thu Aug 22 19:27:30 2019
Great package but wondering why schema.org is not included as most newspapers and media sites seem to use it. Are there any plans to add this?