Datetime coming from response headers issue
I noticed that htmldate utilizes the find_date function, which internally relies on examine_header.
Does it make sense to parse the response header from the server? Do servers typically default this to the current date?
Here’s an example where this date is extracted: '2024-12-02'...
from htmldate import find_date
find_date(
"https://octopus.energy/blog/agile-octopus-bigger-story/",
original_date=True,
extensive_search=True,
)
But the published at is actually...
If I comment lines on examine_header we do extract out the correct date (2022-12-13) during # last resort
Hi @SamComber, the server response is not used by Htmldate, the header is the one in the HTML document where a meta tag is set to a very late date, as I just checked: <meta name="created" content="6th Dec 2024 12:01">.
Sometimes the information on the pages is not reliable and it's hard to discriminate between several fields which are all plausible dates.