newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

Date regex should not assume date of month from just first (two) digits after /

Open bact opened this issue 3 years ago • 4 comments

It looks like the current urls.STRICT_DATE_REGES immediately takes the first (two) digit(s) after a slash as date of month.

>>> from newspaper import Article
>>> url = "https://prachatai.com/journal/2021/04/92713"
>>> article = Article(url)
>>> article.download()
>>> article.parse()
>>> article.publish_date
datetime.datetime(2021, 4, 9, 0, 0)

Test list

  • https://prachatai.com/journal/2020/06/88083
    • actual publish date: 2020-06-11
    • date from newspaper: 2020-06-08
  • https://prachatai.com/journal/2021/04/92713
    • actual publish date: 2021-04-25
    • date from newspaper: 2020-04-09
  • https://prachatai.com/journal/2021/04/92735
    • actual publish date: 2021-04-26
    • date from newspaper: 2020-04-09
  • https://prachatai.com/journal/2021/05/92906
    • actual publish date: 2021-05-06
    • date from newspaper: 2020-05-09

Related lines of code

https://github.com/codelucas/newspaper/blob/f622011177f6c2e95e48d6076561e21c016f08c3/newspaper/extractors.py#L191-L196

bact avatar May 06 '21 17:05 bact

I did some research into this issue. Are the digits 92906 the article's reference number? If this is the article's reference number then Newspaper will always fail to convert this date correctly. I noted that prachatai.com doesn't have its article published date in any other tag that Newspaper extracts from.

I would recommend extracting the published date from prachatai.com's article using BeautifulSoup. Look at my newspaper3 usage overview document for examples on how to do this.

johnbumgarner avatar May 10 '21 13:05 johnbumgarner

Yes, that 92906 part is the article's reference number.

Thank you for the pointer, I will take a look on that.

bact avatar May 11 '21 04:05 bact

Yes, that 92906 part is the article's reference number.

Thank you for the pointer, I will take a look on that.

You're welcome. Please close this issue, because it wasn't really Newspaper issue.

johnbumgarner avatar May 11 '21 22:05 johnbumgarner

If I can suggest: If year and month are presented in the url, but not date; should we use default date = 1 instead of picking the first digit after / ?

bact avatar May 15 '21 09:05 bact