newspaper
newspaper copied to clipboard
I just want to help with date extraction
These are the names of tags that can be found in SCRIPT or META tags that represent dates, maybe you will find this helpful:
publishdatepublish-date prism.publicationDate coverageEndTime uploadDate date published_date published_time pubdate publish_date Date published_at PublishDate dcterms.created rnews:datePublished article:published_time czhdev.publicationDate OriginalPublicationDate og:published_time datePublished article_date_original czhdev.publicationDate article.published published_time_telegram sailthru.date DC.date.issued date parsely-pub-date publishtime publication_date coverageEndTime,publishdate publish-datepublishedAtDate creationDateTime pub_date updated_time dateModified og:updated_time last-modified Last-Modified DC.date.modified krn:published_time article:modified_time modified_time modifiedDateTime dc.modified
this is the source code that is taking care of the publishe tags
PUBLISH_DATE_TAGS = [ {'attribute': 'property', 'value': 'rnews:datePublished', 'content': 'content'}, {'attribute': 'property', 'value': 'article:published_time', 'content': 'content'}, {'attribute': 'name', 'value': 'OriginalPublicationDate', 'content': 'content'}, {'attribute': 'itemprop', 'value': 'datePublished', 'content': 'datetime'}, {'attribute': 'property', 'value': 'og:published_time', 'content': 'content'}, {'attribute': 'name', 'value': 'article_date_original', 'content': 'content'}, {'attribute': 'name', 'value': 'publication_date', 'content': 'content'}, {'attribute': 'name', 'value': 'sailthru.date', 'content': 'content'}, {'attribute': 'name', 'value': 'PublishDate', 'content': 'content'}, {'attribute': 'pubdate', 'value': 'pubdate', 'content': 'datetime'}, {'attribute': 'name', 'value': 'publish_date', 'content': 'content'}, ]
https://github.com/codelucas/newspaper/blob/master/newspaper/extractors.py line 198 till 235 , you could add your list to the dic array and open a pull request