Adrien Barbaresi
Adrien Barbaresi
Problems found: - [x] date present in title, not easily extractable otherwise: https://web.archive.org/web/20210909014928/https://www.dw.com/de/was-vor-corona-sch%C3%BCtzt-wird-f%C3%BCr-die-umwelt-ein-problem/a-53217831 - [x] useless data sent to external parser (legal references): https://web.archive.org/web/20201211092330/https://openjur.de/u/2309866.html - [x] phone numbers sent to...
The first example is especially tricky, the date in the right column is tagged as a proper date in the HTML whereas the date in the main content isn't.
Hi @kinoute, `htmldate` considers that the date `1991-01-02` isn't valid. You can try to set the parameter `min_date` in `find_date()` to change this, e.g. `min_date="1990-01-01"`.
@kinoute Thanks for pointing that out, it's a bug.
Hi @kvasilopoulos, I can reproduce your example, the URLs are scanned first because the information they contain is usually more significant than the one in the header. It is a...
I can imagine it would be possible to change the extraction order if a format like `%Y-%m-%d %H:%M:%S` is requested. The to-do list would then look as follows: - [...
Hi @rahulbot, it would be OK but I'd prefer to get to chance to tackle the problem first. There is certainly a field in the HTML where the date can...
@coreydockser Thanks, I'll look at it and see if I can find a solution.
Hi @coreydockser, I checked the cases and I don't agree with you at all: - A few results were different (maybe you didn't try the last version). - Besides, `None`...
Thanks for the explanations, I get your point. Indeed, `htmldate` mostly provides a technology-informed concept of datation. It hopefully intersects the news-ish definition in most cases, however the two may...