htmldate
htmldate copied to clipboard
Test htmldate on further web pages and report bugs
I have mostly tested htmldate
on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.
Please install the dateparser
library beforehand as it significantly extends linguistic coverage: pip
or pip3 install -U dateparser
or pip install -U htmldate[all]
.
Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS
and ADDITIONAL_EXPRESSIONS
).
Thanks!
Problems found:
- [x] date present in title, not easily extractable otherwise: https://web.archive.org/web/20210909014928/https://www.dw.com/de/was-vor-corona-sch%C3%BCtzt-wird-f%C3%BCr-die-umwelt-ein-problem/a-53217831
- [x] useless data sent to external parser (legal references): https://web.archive.org/web/20201211092330/https://openjur.de/u/2309866.html
- [x] phone numbers sent to parser: https://www.kath.ch/die-insel-der-klosterzoeglinge/
- [x] archive.org date interfering: http://web.archive.org/web/20210916140120/https://www.kath.ch/die-insel-der-klosterzoeglinge/
- [x] wrong date found with
original=False
: https://web.archive.org/web/20210713181611/https://www.hsozkult.de/event/id/event-98675 - [ ] wrong date: https://web.archive.org/web/20210713091918/https://www.titanic-magazin.de/news/cdu-verkleidet-eigene-angestellte-12180/
- [ ] wrong date: https://web.archive.org/web/20210304072303/https://che2001.blogger.de/STORIES/2789989/
- [ ] wrong date: https://web.archive.org/web/20210119052155/https://lundi.am/Alors-a-Nantes-la-police-a-jete-a-l-eau-ceux-qui-dansaient
- [ ] 2017-05-10 → 2017-05-03: https://web.archive.org/web/20210921163708/https://plentylife.blogspot.com/2017/05/strong-beautiful-pamela-reif-rezension.html
- [ ] no plausible date found: https://web.archive.org/web/20210921163943/https://www.stadionwelt.de/news/22764/umbaumassnahmen-einfluss-der-coronakrise-spuerbar
- [x] modified date missed: https://web.archive.org/web/20210125072413/http://www.digitalwiki.de/omr-online-marketing-rockstars/
- [ ] wrong date found: https://web.archive.org/web/20220721013749/https://www.ksal.com/high-wheat-quality-expected-despite-yield-drop/ (it is missing the string "June 15, 2022" in an
<h5> → <span>
and instead picking "July 20, 2022" from a footer) - [ ] Russian language date missed: https://web.archive.org/web/20220721014119/https://www.inopressa.ru/article/09Mar2017/welt/deutschland.html (it is missing the Russian language date "9 марта 2017")
The first example is especially tricky, the date in the right column is tagged as a proper date in the HTML whereas the date in the main content isn't.
-
URL: https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083 (from 1991)
-
Code :
from htmldate import find_date
import requests
resp = requests.get('https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083')
find_date(
resp.content.decode(errors='ignore'),
extensive_search=True,
outputformat='%Y-%m-%d %H:%M:%S',
)
- results : 2022-07-26 00:00:00
But in the HTML source code there is a meta
entry with the correct date:
<meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00"/>
I thought htmldate
will look at this in the first place or am I missing something?
Hi @kinoute, htmldate
considers that the date 1991-01-02
isn't valid. You can try to set the parameter min_date
in find_date()
to change this, e.g. min_date="1990-01-01"
.
@adbar It still doesn't work with your min_date
Here is the debugging without the min_date
:
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'
With min_date
at "1990-01-01":
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.extractors:custom parse test: 1991-01-02T01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 00:00:00
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.validators:date not valid: 1991-01-02
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:custom parse test: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:send to external parser: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.extractors:custom parse test: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.extractors:send to external parser: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'
@kinoute Thanks for pointing that out, it's a bug.
htmldate==1.2.3 used in https://github.com/ofou/graham-essays is incorrectly extracting dates. See output. The essays have MONTH YEAR below the title but that's not being picked up. Example: http://www.paulgraham.com/greatwork.html
In a fork I tried updating to the latest version and it has the same issue.
@dideler Thanks, the year is detected correctly but not the month which is contained in a <font>
tag. I'll see what I can do.
Thank you for this wonderful tool! It would be great to see this news source added.
Capacity Media e.g. https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity
<div class="ArticlePage-datePublished">
February 13, 2023 11:42 AM
</div>
@stevesong It already works:
$ htmldate -u "https://web.archive.org/web/20240111084001/https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity"
2023-02-13
Wow, thanks! I must have been using an older version. Passing urls through archive.org appears to have a normalising effect on some websites in that htmldate works on the archive.org versions but not the original?
It's not supposed to normalize anything, I'm just using archived versions to be able to replicate issues at some point in the future.
Ok, understood, but there does appear to be something interesting happening there.
$ htmldate -u https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
# ERROR no valid result for url: https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
$ htmldate -u https://web.archive.org/web/https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
2023-01-20
I guess it's because the download fails, there are websites which restrict access to the download utility, see https://trafilatura.readthedocs.io/en/latest/troubleshooting.html