newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

get publish date failed

Open saha65536 opened this issue 3 years ago • 3 comments

test 1000 urls from 100 web site ,90% publis date is None..

saha65536 avatar May 31 '21 03:05 saha65536

This likely happened, because the structure of those 90% are different than the other 10%. Sometime you need to configure newspaper to extract the content based on the web pages structure.

Please provide some examples of the ones that failed.

johnbumgarner avatar May 31 '21 13:05 johnbumgarner

https://jp.weforum.org/agenda/2021/03/nado-no-ha-ka/ https://jp.weforum.org/agenda/2021/04/kyasshuresu-no-wo-me-ajiawoyori-na-ni-kuniha/

saha65536 avatar Jun 03 '21 00:06 saha65536

Newspaper has strategies for extracting publish dates. The strategies below are in descending order based on accuracy. If a strategy fails then another one is attempted.

  1. Pubdate from URL
  2. Pubdate from metadata
  3. Raw regex searches in the HTML + added heuristics

The first strategy fails, because the URL doesn't have a complete date.

The second strategy fails, because the target website has the published date in a tag not queried by newspaper.

The third strategy fails, because the date string contains Japanese characters - 2021年03月23日

The best option is for you to use BeautifulSoup to extract the date from the target website.

johnbumgarner avatar Jun 03 '21 18:06 johnbumgarner