newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

I just want to help with date extraction

Open aleksandar-devedzic opened this issue 1 year ago • 1 comments

These are the names of tags that can be found in SCRIPT or META tags that represent dates, maybe you will find this helpful:

publishdatepublish-date prism.publicationDate coverageEndTime uploadDate date published_date published_time pubdate publish_date Date published_at PublishDate dcterms.created rnews:datePublished article:published_time czhdev.publicationDate OriginalPublicationDate og:published_time datePublished article_date_original czhdev.publicationDate article.published published_time_telegram sailthru.date DC.date.issued date parsely-pub-date publishtime publication_date coverageEndTime,publishdate publish-datepublishedAtDate creationDateTime pub_date updated_time dateModified og:updated_time last-modified Last-Modified DC.date.modified krn:published_time article:modified_time modified_time modifiedDateTime dc.modified

aleksandar-devedzic avatar Aug 25 '22 21:08 aleksandar-devedzic

this is the source code that is taking care of the publishe tags PUBLISH_DATE_TAGS = [ {'attribute': 'property', 'value': 'rnews:datePublished', 'content': 'content'}, {'attribute': 'property', 'value': 'article:published_time', 'content': 'content'}, {'attribute': 'name', 'value': 'OriginalPublicationDate', 'content': 'content'}, {'attribute': 'itemprop', 'value': 'datePublished', 'content': 'datetime'}, {'attribute': 'property', 'value': 'og:published_time', 'content': 'content'}, {'attribute': 'name', 'value': 'article_date_original', 'content': 'content'}, {'attribute': 'name', 'value': 'publication_date', 'content': 'content'}, {'attribute': 'name', 'value': 'sailthru.date', 'content': 'content'}, {'attribute': 'name', 'value': 'PublishDate', 'content': 'content'}, {'attribute': 'pubdate', 'value': 'pubdate', 'content': 'datetime'}, {'attribute': 'name', 'value': 'publish_date', 'content': 'content'}, ]

https://github.com/codelucas/newspaper/blob/master/newspaper/extractors.py line 198 till 235 , you could add your list to the dic array and open a pull request

jumbophp avatar Sep 30 '22 07:09 jumbophp