newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

ENH: parse schema.org/NewsArticle RDFa, Microdata, or JSONLD

Open AndyTheFactory opened this issue 2 years ago • 3 comments

Issue by westurner Wed Sep 27 19:51:34 2017 Originally opened as https://github.com/codelucas/newspaper/issues/448


Schema.org Linked Data:

  • http://schema.org/NewsArticle
  • http://schema.org/docs/news.html

Parsers:

  • http://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.graph.Graph.parse
  • https://rdflib.readthedocs.io/en/stable/_modules/rdflib/plugins/parsers/pyRdfa.html#pyRdfa
  • https://rdflib.readthedocs.io/en/stable/_modules/rdflib/plugins/parsers/pyMicrodata/microdata.html
  • https://github.com/RDFLib/rdflib-jsonld

Tasks:

  • [ ] sniff(fileobj, location)
  • [ ] parse(fileobj, location)
    • rdflib.Graph().parse(fileobj, uri, format='rdfa'|'microdata'|'jsonld') # for each

AndyTheFactory avatar Oct 24 '23 10:10 AndyTheFactory

Comment by westurner Wed Sep 27 19:55:59 2017


#251 mentions itemProp="datePublished" (Microdata)

The RDFa for this property could be:

  • property="http://schema.org/datePublished"
  • property="datePublished"
  • property="schema:datePublished"
  • property="http://schema.org/datePublished" content="ISO8601date+timeZ'

AndyTheFactory avatar Oct 24 '23 10:10 AndyTheFactory

Comment by westurner Sat Oct 27 21:09:59 2018


Extruct is the best tool for accomplishing this, IMHO https://github.com/scrapinghub/extruct

extruct.extract() https://github.com/scrapinghub/extruct/blob/master/extruct/_extruct.py

https://github.com/RDFLib/rdflib/issues/770#issuecomment-433655142

AndyTheFactory avatar Oct 24 '23 10:10 AndyTheFactory

Comment by simonm3 Thu Aug 22 19:27:30 2019


Great package but wondering why schema.org is not included as most newspapers and media sites seem to use it. Are there any plans to add this?

AndyTheFactory avatar Oct 24 '23 10:10 AndyTheFactory