Adrien Barbaresi comments

Results 319 comments of


                                            Adrien Barbaresi

Test htmldate on further web pages and report bugs

Problems found: - [x] date present in title, not easily extractable otherwise: https://web.archive.org/web/20210909014928/https://www.dw.com/de/was-vor-corona-sch%C3%BCtzt-wird-f%C3%BCr-die-umwelt-ein-problem/a-53217831 - [x] useless data sent to external parser (legal references): https://web.archive.org/web/20201211092330/https://openjur.de/u/2309866.html - [x] phone numbers sent to...

Test htmldate on further web pages and report bugs

The first example is especially tricky, the date in the right column is tagged as a proper date in the HTML whereas the date in the main content isn't.

Test htmldate on further web pages and report bugs

Hi @kinoute, `htmldate` considers that the date `1991-01-02` isn't valid. You can try to set the parameter `min_date` in `find_date()` to change this, e.g. `min_date="1990-01-01"`.

Test htmldate on further web pages and report bugs

@kinoute Thanks for pointing that out, it's a bug.

return datetime instead of date

Hi @kvasilopoulos, I can reproduce your example, the URLs are scanned first because the information they contain is usually more significant than the one in the header. It is a...

return datetime instead of date

I can imagine it would be possible to change the extraction order if a format like `%Y-%m-%d %H:%M:%S` is requested. The to-do list would then look as follows: - [...

ignore undateable domains more intentionally

Hi @rahulbot, it would be OK but I'd prefer to get to chance to tackle the problem first. There is certainly a field in the HTML where the date can...

ignore undateable domains more intentionally

@coreydockser Thanks, I'll look at it and see if I can find a solution.

ignore undateable domains more intentionally

Hi @coreydockser, I checked the cases and I don't agree with you at all: - A few results were different (maybe you didn't try the last version). - Besides, `None`...

ignore undateable domains more intentionally

Thanks for the explanations, I get your point. Indeed, `htmldate` mostly provides a technology-informed concept of datation. It hopefully intersects the news-ish definition in most cases, however the two may...