htmldate icon indicating copy to clipboard operation
htmldate copied to clipboard

The script does not find the date (Russian)

Open PetroffSky opened this issue 1 year ago • 3 comments

The script does not find the date (Russian): from htmldate import find_date

url = "https://kamaz.ru/press/releases/kamaz_i_skolkovo_sozdadut_ekologicheski_chistyy_gruzovik/"

print(find_date(url, extensive_search=True)) # Returns None print(find_date(url, extensive_search=False)) # Returns None

Xpath selector of dates on the page: //div[contains(text(), 'July 30, 2015')]

PetroffSky avatar Jul 03 '24 05:07 PetroffSky

Something has to be added to the extractors otherwise the div element will not be processed (e.g. class contains "news" or "detail").

adbar avatar Jul 16 '24 15:07 adbar

Hello! I'm sorry. My mistake. Here is the correct xpath: //div[contains(text(), '30 Июля 2015')] also the names of the months in Russian in order in two versions: months = ['январь', 'февраль', 'март', 'апрель', 'май', 'июнь', 'июль', 'август', 'сентябрь', 'октябрь', 'ноябрь', 'декабрь'] or months = ['января', 'февраля', 'марта', 'апреля', 'мая', 'июня', 'июля', 'августа', 'сентября', 'октября', 'ноября', 'декабря']

PetroffSky avatar Jul 17 '24 06:07 PetroffSky

I meant that someone need to add a precise XPath target, using //div[contains(text())] or simply //div//text() would be bad for accuracy because random dates in a text are often irrelevant.

As for the months if you're interested you could add them to the extractor in a pull request.

adbar avatar Jul 19 '24 09:07 adbar