newspaper
newspaper copied to clipboard
Date regex should not assume date of month from just first (two) digits after /
It looks like the current urls.STRICT_DATE_REGES
immediately takes the first (two) digit(s) after a slash as date of month.
>>> from newspaper import Article
>>> url = "https://prachatai.com/journal/2021/04/92713"
>>> article = Article(url)
>>> article.download()
>>> article.parse()
>>> article.publish_date
datetime.datetime(2021, 4, 9, 0, 0)
Test list
- https://prachatai.com/journal/2020/06/88083
- actual publish date: 2020-06-11
- date from newspaper: 2020-06-08
- https://prachatai.com/journal/2021/04/92713
- actual publish date: 2021-04-25
- date from newspaper: 2020-04-09
- https://prachatai.com/journal/2021/04/92735
- actual publish date: 2021-04-26
- date from newspaper: 2020-04-09
- https://prachatai.com/journal/2021/05/92906
- actual publish date: 2021-05-06
- date from newspaper: 2020-05-09
Related lines of code
https://github.com/codelucas/newspaper/blob/f622011177f6c2e95e48d6076561e21c016f08c3/newspaper/extractors.py#L191-L196
I did some research into this issue. Are the digits 92906 the article's reference number? If this is the article's reference number then Newspaper will always fail to convert this date correctly. I noted that prachatai.com doesn't have its article published date in any other tag that Newspaper extracts from.
I would recommend extracting the published date from prachatai.com's article using BeautifulSoup. Look at my newspaper3 usage overview document for examples on how to do this.
Yes, that 92906
part is the article's reference number.
Thank you for the pointer, I will take a look on that.
Yes, that
92906
part is the article's reference number.Thank you for the pointer, I will take a look on that.
You're welcome. Please close this issue, because it wasn't really Newspaper issue.
If I can suggest: If year and month are presented in the url, but not date; should we use default date = 1 instead of picking the first digit after / ?