htmldate Test htmldate on further web pages and report bugs

Test htmldate on further web pages and report bugs

Open adbar opened this issue 4 years ago • 15 comments

I have mostly tested htmldate on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.

Please install the dateparser library beforehand as it significantly extends linguistic coverage: pipor pip3 install -U dateparser or pip install -U htmldate[all].

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS and ADDITIONAL_EXPRESSIONS).

Thanks!

Jan 03 '20 16:01 adbar

Problems found:

[x] date present in title, not easily extractable otherwise: https://web.archive.org/web/20210909014928/https://www.dw.com/de/was-vor-corona-sch%C3%BCtzt-wird-f%C3%BCr-die-umwelt-ein-problem/a-53217831
[x] useless data sent to external parser (legal references): https://web.archive.org/web/20201211092330/https://openjur.de/u/2309866.html
[x] phone numbers sent to parser: https://www.kath.ch/die-insel-der-klosterzoeglinge/
[x] archive.org date interfering: http://web.archive.org/web/20210916140120/https://www.kath.ch/die-insel-der-klosterzoeglinge/
[x] wrong date found with original=False: https://web.archive.org/web/20210713181611/https://www.hsozkult.de/event/id/event-98675
[ ] wrong date: https://web.archive.org/web/20210713091918/https://www.titanic-magazin.de/news/cdu-verkleidet-eigene-angestellte-12180/
[ ] wrong date: https://web.archive.org/web/20210304072303/https://che2001.blogger.de/STORIES/2789989/
[ ] wrong date: https://web.archive.org/web/20210119052155/https://lundi.am/Alors-a-Nantes-la-police-a-jete-a-l-eau-ceux-qui-dansaient
[ ] 2017-05-10 → 2017-05-03: https://web.archive.org/web/20210921163708/https://plentylife.blogspot.com/2017/05/strong-beautiful-pamela-reif-rezension.html
[ ] no plausible date found: https://web.archive.org/web/20210921163943/https://www.stadionwelt.de/news/22764/umbaumassnahmen-einfluss-der-coronakrise-spuerbar
[x] modified date missed: https://web.archive.org/web/20210125072413/http://www.digitalwiki.de/omr-online-marketing-rockstars/

Sep 16 '21 14:09 adbar

[ ] wrong date found: https://web.archive.org/web/20220721013749/https://www.ksal.com/high-wheat-quality-expected-despite-yield-drop/ (it is missing the string "June 15, 2022" in an <h5> → <span> and instead picking "July 20, 2022" from a footer)
[ ] Russian language date missed: https://web.archive.org/web/20220721014119/https://www.inopressa.ru/article/09Mar2017/welt/deutschland.html (it is missing the Russian language date "9 марта 2017")

Jul 21 '22 02:07 rahulbot

The first example is especially tricky, the date in the right column is tagged as a proper date in the HTML whereas the date in the main content isn't.

Jul 21 '22 12:07 adbar

URL: https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083 (from 1991)
Code :

from htmldate import find_date
import requests
resp = requests.get('https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083')

find_date(
  resp.content.decode(errors='ignore'),
  extensive_search=True,
  outputformat='%Y-%m-%d %H:%M:%S',
)

results : 2022-07-26 00:00:00

But in the HTML source code there is a meta entry with the correct date:

<meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00"/>

I thought htmldate will look at this in the first place or am I missing something?

Aug 03 '22 13:08 kinoute

Hi @kinoute, htmldate considers that the date 1991-01-02 isn't valid. You can try to set the parameter min_date in find_date() to change this, e.g. min_date="1990-01-01".

Aug 03 '22 15:08 adbar

@adbar It still doesn't work with your min_date

Here is the debugging without the min_date:

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

With min_date at "1990-01-01":

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.extractors:custom parse test: 1991-01-02T01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 00:00:00
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.validators:date not valid: 1991-01-02
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:custom parse test: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:send to external parser: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.extractors:custom parse test: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.extractors:send to external parser: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

Aug 03 '22 16:08 kinoute

@kinoute Thanks for pointing that out, it's a bug.

Aug 04 '22 10:08 adbar

htmldate==1.2.3 used in https://github.com/ofou/graham-essays is incorrectly extracting dates. See output. The essays have MONTH YEAR below the title but that's not being picked up. Example: http://www.paulgraham.com/greatwork.html

In a fork I tried updating to the latest version and it has the same issue.

Aug 22 '23 12:08 dideler

@dideler Thanks, the year is detected correctly but not the month which is contained in a <font> tag. I'll see what I can do.

Aug 30 '23 15:08 adbar

Thank you for this wonderful tool! It would be great to see this news source added.

Capacity Media e.g. https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity

    <div class="ArticlePage-datePublished">
            February 13, 2023 11:42 AM
    </div>

Nov 21 '23 16:11 stevesong

@stevesong It already works:

$ htmldate -u "https://web.archive.org/web/20240111084001/https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity"
2023-02-13

Jan 16 '24 17:01 adbar

Wow, thanks! I must have been using an older version. Passing urls through archive.org appears to have a normalising effect on some websites in that htmldate works on the archive.org versions but not the original?

Jan 16 '24 17:01 stevesong

It's not supposed to normalize anything, I'm just using archived versions to be able to replicate issues at some point in the future.

Jan 17 '24 14:01 adbar

Ok, understood, but there does appear to be something interesting happening there.

$ htmldate -u https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
# ERROR no valid result for url: https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/

$ htmldate -u https://web.archive.org/web/https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
2023-01-20

Jan 17 '24 15:01 stevesong

I guess it's because the download fails, there are websites which restrict access to the download utility, see https://trafilatura.readthedocs.io/en/latest/troubleshooting.html

Jan 17 '24 16:01 adbar

htmldate htmldate copied to clipboard

Test htmldate on further web pages and report bugs

htmldate
htmldate copied to clipboard