newspaper Date regex should not assume date of month from just first (two) digits after /

Date regex should not assume date of month from just first (two) digits after /

Open bact opened this issue 3 years ago • 4 comments

It looks like the current urls.STRICT_DATE_REGES immediately takes the first (two) digit(s) after a slash as date of month.

>>> from newspaper import Article
>>> url = "https://prachatai.com/journal/2021/04/92713"
>>> article = Article(url)
>>> article.download()
>>> article.parse()
>>> article.publish_date
datetime.datetime(2021, 4, 9, 0, 0)

Test list

https://prachatai.com/journal/2020/06/88083
- actual publish date: 2020-06-11
- date from newspaper: 2020-06-08
https://prachatai.com/journal/2021/04/92713
- actual publish date: 2021-04-25
- date from newspaper: 2020-04-09
https://prachatai.com/journal/2021/04/92735
- actual publish date: 2021-04-26
- date from newspaper: 2020-04-09
https://prachatai.com/journal/2021/05/92906
- actual publish date: 2021-05-06
- date from newspaper: 2020-05-09

Related lines of code

https://github.com/codelucas/newspaper/blob/f622011177f6c2e95e48d6076561e21c016f08c3/newspaper/extractors.py#L191-L196

May 06 '21 17:05 bact

I did some research into this issue. Are the digits 92906 the article's reference number? If this is the article's reference number then Newspaper will always fail to convert this date correctly. I noted that prachatai.com doesn't have its article published date in any other tag that Newspaper extracts from.

I would recommend extracting the published date from prachatai.com's article using BeautifulSoup. Look at my newspaper3 usage overview document for examples on how to do this.

May 10 '21 13:05 johnbumgarner

Yes, that 92906 part is the article's reference number.

Thank you for the pointer, I will take a look on that.

May 11 '21 04:05 bact

Yes, that 92906 part is the article's reference number.

Thank you for the pointer, I will take a look on that.

You're welcome. Please close this issue, because it wasn't really Newspaper issue.

May 11 '21 22:05 johnbumgarner

If I can suggest: If year and month are presented in the url, but not date; should we use default date = 1 instead of picking the first digit after / ?

May 15 '21 09:05 bact

newspaper newspaper copied to clipboard

Date regex should not assume date of month from just first (two) digits after /

Test list

Related lines of code

newspaper
newspaper copied to clipboard