htmldate icon indicating copy to clipboard operation
htmldate copied to clipboard

missing timezone

Open TheCutestCat opened this issue 1 year ago • 1 comments

The revelant HTML file : htmldate_debug_no_timezone.html.zip

Thanks for all your hard work! htmldate is very useful for me. But when I use htmldate, I found ther is no timezone in the result. I try the code :

from htmldate import find_date
from pathlib import Path

content = Path(input_path).read_text(encoding='utf-8')

from lxml import html
mytree = html.fromstring(content)

publish_time = find_date(mytree, outputformat="%Y-%m-%d %H:%M:%S%z")

HTML be like :

"datePublished": "2024-11-06T08:37:00+05:30",

Result I expected :

2024-11-06T08:37:00+05:30

the result from htmldate :

2024-11-06 00:00:00

There is no timezone, please help check this problem. I will be very glad to fix this problem with you.

TheCutestCat avatar Nov 28 '24 06:11 TheCutestCat

Hi @TheCutestCat, when dates are found using HTML markup you get the time zone, when they are extracted from free text regexes are applied. The regular expressions don't include time zones for now. Feel free to have a look and draft a pull request, your case is here (and others below and above):

https://github.com/adbar/htmldate/blob/9c5f619db70fd6e32ceab6ebd63af60ff1f6b166/htmldate/extractors.py#L135

adbar avatar Nov 29 '24 10:11 adbar