htmldate icon indicating copy to clipboard operation
htmldate copied to clipboard

Datetime coming from response headers issue

Open SamComber opened this issue 1 year ago • 1 comments

I noticed that htmldate utilizes the find_date function, which internally relies on examine_header.

Does it make sense to parse the response header from the server? Do servers typically default this to the current date?

Here’s an example where this date is extracted: '2024-12-02'...

from htmldate import find_date

find_date(
    "https://octopus.energy/blog/agile-octopus-bigger-story/",
    original_date=True,
    extensive_search=True,
)

But the published at is actually...

image

If I comment lines on examine_header we do extract out the correct date (2022-12-13) during # last resort

SamComber avatar Dec 02 '24 22:12 SamComber

Hi @SamComber, the server response is not used by Htmldate, the header is the one in the HTML document where a meta tag is set to a very late date, as I just checked: <meta name="created" content="6th Dec 2024 12:01">.

Sometimes the information on the pages is not reliable and it's hard to discriminate between several fields which are all plausible dates.

adbar avatar Dec 06 '24 12:12 adbar