newspaper get publish date failed

get publish date failed

Open saha65536 opened this issue 3 years ago • 3 comments

test 1000 urls from 100 web site ，90% publis date is None..

May 31 '21 03:05 saha65536

This likely happened, because the structure of those 90% are different than the other 10%. Sometime you need to configure newspaper to extract the content based on the web pages structure.

Please provide some examples of the ones that failed.

May 31 '21 13:05 johnbumgarner

https://jp.weforum.org/agenda/2021/03/nado-no-ha-ka/ https://jp.weforum.org/agenda/2021/04/kyasshuresu-no-wo-me-ajiawoyori-na-ni-kuniha/

Jun 03 '21 00:06 saha65536

Newspaper has strategies for extracting publish dates. The strategies below are in descending order based on accuracy. If a strategy fails then another one is attempted.

Pubdate from URL
Pubdate from metadata
Raw regex searches in the HTML + added heuristics

The first strategy fails, because the URL doesn't have a complete date.

The second strategy fails, because the target website has the published date in a tag not queried by newspaper.

The third strategy fails, because the date string contains Japanese characters - 2021年03月23日

The best option is for you to use BeautifulSoup to extract the date from the target website.

Jun 03 '21 18:06 johnbumgarner

newspaper newspaper copied to clipboard

get publish date failed

newspaper
newspaper copied to clipboard