Not able to parse publish date.
Issue by kartikparnami
Wed Apr 6 10:35:59 2016
Originally opened as https://github.com/codelucas/newspaper/issues/234
-> Newspaper is unable to parse date for the URL: http://pratyushsharma.blogspot.in/2016/03/jindagi-mauth-na-ban-jaye-samhalo-yaaron.html
-> On seeing page source publishing date can be seen in the line: <abbr class='published' itemprop='datePublished' title='2016-03-31T03:55:00-07:00'>3:55 AM</abbr>
-> Newspaper is able to get till this DOM element by matching it with
{'attribute': 'itemprop', 'value': 'datePublished', 'content': 'datetime'}
in line 203 in extractors.py -> But the content attribute is not matched instead a new attribute title is needed. -> We need to add
{'attribute': 'itemprop', 'value': 'datePublished', 'content': 'title'}
to the PUBLISH_DATE_TAGS to get this read properly.
Comment by yprez
Thu May 5 08:38:40 2016
@kartikparnami good find. Are you sure it's supposed to be "title" and it's not a mistake of this specific blog?
https://schema.org/datePublished says the attribute should be "content", other sources refer to "datetime", but I couldn't find any other examples with "title"...
Comment by kartikparnami
Fri May 6 06:30:09 2016
Well, I dont know how widespread this issue is and whether its a blog specific issue. But, I feel an addition just increases our coverage of the cases. Let me know your thoughts.
Comment by mamoit
Fri Jul 21 23:12:44 2017
#402 changes the behaviour to follow the schema of datePublished. Doesn't solve this problem in particular, but this seems to be an isolated case of out of spec metadata.
Comment by saqibaliXIQ
Mon Dec 17 10:16:34 2018
not able to parse date of many domain articles marketscreener.com contagionlive and many other but diffbot does but its paid