newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

Not able to parse publish date.

Open AndyTheFactory opened this issue 2 years ago • 5 comments

Issue by kartikparnami Wed Apr 6 10:35:59 2016 Originally opened as https://github.com/codelucas/newspaper/issues/234


-> Newspaper is unable to parse date for the URL: http://pratyushsharma.blogspot.in/2016/03/jindagi-mauth-na-ban-jaye-samhalo-yaaron.html -> On seeing page source publishing date can be seen in the line: <abbr class='published' itemprop='datePublished' title='2016-03-31T03:55:00-07:00'>3:55 AM</abbr>

-> Newspaper is able to get till this DOM element by matching it with

{'attribute': 'itemprop', 'value': 'datePublished', 'content': 'datetime'}

in line 203 in extractors.py -> But the content attribute is not matched instead a new attribute title is needed. -> We need to add

{'attribute': 'itemprop', 'value': 'datePublished', 'content': 'title'}

to the PUBLISH_DATE_TAGS to get this read properly.

AndyTheFactory avatar Oct 24 '23 07:10 AndyTheFactory

Comment by yprez Thu May 5 08:38:40 2016


@kartikparnami good find. Are you sure it's supposed to be "title" and it's not a mistake of this specific blog?

https://schema.org/datePublished says the attribute should be "content", other sources refer to "datetime", but I couldn't find any other examples with "title"...

AndyTheFactory avatar Oct 24 '23 07:10 AndyTheFactory

Comment by kartikparnami Fri May 6 06:30:09 2016


Well, I dont know how widespread this issue is and whether its a blog specific issue. But, I feel an addition just increases our coverage of the cases. Let me know your thoughts.

AndyTheFactory avatar Oct 24 '23 07:10 AndyTheFactory

Comment by yprez Fri May 13 19:31:09 2016


Similar to #151

AndyTheFactory avatar Oct 24 '23 07:10 AndyTheFactory

Comment by mamoit Fri Jul 21 23:12:44 2017


#402 changes the behaviour to follow the schema of datePublished. Doesn't solve this problem in particular, but this seems to be an isolated case of out of spec metadata.

AndyTheFactory avatar Oct 24 '23 07:10 AndyTheFactory

Comment by saqibaliXIQ Mon Dec 17 10:16:34 2018


not able to parse date of many domain articles marketscreener.com contagionlive and many other but diffbot does but its paid

AndyTheFactory avatar Oct 24 '23 07:10 AndyTheFactory