newspaper
newspaper copied to clipboard
get publish date failed
test 1000 urls from 100 web site ,90% publis date is None..
This likely happened, because the structure of those 90% are different than the other 10%. Sometime you need to configure newspaper to extract the content based on the web pages structure.
Please provide some examples of the ones that failed.
https://jp.weforum.org/agenda/2021/03/nado-no-ha-ka/ https://jp.weforum.org/agenda/2021/04/kyasshuresu-no-wo-me-ajiawoyori-na-ni-kuniha/
Newspaper has strategies for extracting publish dates. The strategies below are in descending order based on accuracy. If a strategy fails then another one is attempted.
- Pubdate from URL
- Pubdate from metadata
- Raw regex searches in the HTML + added heuristics
The first strategy fails, because the URL doesn't have a complete date.
The second strategy fails, because the target website has the published date in a tag not queried by newspaper.
The third strategy fails, because the date string contains Japanese characters - 2021年03月23日
The best option is for you to use BeautifulSoup to extract the date from the target website.