htmldate icon indicating copy to clipboard operation
htmldate copied to clipboard

Fast and robust date extraction from web pages, with Python or on the command-line

htmldate: find the publication date of web pages

.. image:: :target: :alt: Python package

.. image:: :target: :alt: Python versions

.. image:: :target: :alt: Documentation Status

.. image:: :target: :alt: Code Coverage

.. image:: :target: :alt: Downloads

.. image:: :target: :alt: JOSS article reference DOI: 10.21105/joss.02439

.. image:: :target: :alt: Code style: black


Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included.

In a nutshell


.. image:: docs/htmldate-demo.gif :alt: Demo as GIF image :align: center :width: 80% :target:


With Python:

.. code-block:: python

>>> from htmldate import find_date
>>> find_date('')

On the command-line:

.. code-block:: bash

$ htmldate -u


  • Multilingual, robust and efficient (used in production on millions of documents)
  • URLs, HTML files, or HTML trees are given as input (includes batch processing)
  • Output as string in any date format (defaults to ISO 8601 YMD <>_)
  • Detection of both original and updated dates
  • Compatible with all recent versions of Python

htmldate can examine markup and text. It provides the following ways to date an HTML document:

  1. Markup in header: Common patterns are used to identify relevant elements (e.g. link and meta elements) including Open Graph protocol <>_ attributes
  2. HTML code: The whole document is searched for structural markers: abbr or time elements and a series of attributes (e.g. postmetadata)
  3. Bare HTML content: Heuristics are run on text and markup:
  • in fast mode the HTML page is cleaned and precise patterns are targeted
  • in extensive mode all potential dates are collected and a disambiguation algorithm determines the best one

Finally the output is validated and converted to the chosen format.


=============================== ========= ========= ========= ========= ======= 500 web pages containing identifiable dates (as of 2022-03-23 on Python 3.8)

Python Package Precision Recall Accuracy F-Score Time =============================== ========= ========= ========= ========= ======= articleDateExtractor 0.20 0.769 0.691 0.572 0.728 4.4x date_guesser 2.1.4 0.738 0.544 0.456 0.626 17x goose3 3.1.11 0.821 0.453 0.412 0.584 15x htmldate[all] 1.2.1 (fast) 0.848 0.921 0.790 0.883 1x htmldate[all] 1.2.1 (extensive) 0.839 0.990 0.832 0.908 2.3x newspaper3k 0.2.8 0.729 0.630 0.510 0.675 12x news-please 1.5.21 0.769 0.691 0.572 0.728 40x =============================== ========= ========= ========= ========= =======

For complete results and explanations see the evaluation page <>_.


This Python package is tested on Linux, macOS and Windows systems; it is compatible with Python 3.6 upwards. It is available on the package repository PyPI <>_ and can notably be installed with pip (pip3 where applicable): pip install htmldate and optionally pip install htmldate[speed].


For more details on installation, Python & CLI usage, please refer to the documentation: <>_


htmldate is distributed under the GNU General Public License v3.0 <>. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length <>, multi-licensing <>_ with compatible licenses <>, or contacting me <>.

See also GPL and free software licensing: What's in it for business? <>_


This effort is part of methods to derive information from web documents in order to build text databases for research <>_ (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. There are web pages for which neither the URL nor the server response provide a reliable way to find out when a document was published or modified. For more information:

.. image:: :target: :alt: JOSS article reference DOI: 10.21105/joss.02439

.. image:: :target: :alt: Zenodo archive DOI: 10.5281/zenodo.3459599

.. code-block:: shell

  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {},
  publisher = {The Open Journal},
  year = 2020,
  • Barbaresi, A. "htmldate: A Python package to extract publication dates from web pages <>_", Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439
  • Barbaresi, A. "Generic Web Content Extraction with Open-Source Software <>_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
  • Barbaresi, A. "Efficient construction of metadata-enhanced web corpora <>", Proceedings of the 10th Web as Corpus Workshop (WAC-X) <>, 2016.

You can contact me via my contact page <>_ or GitHub <>_.


Contributions <>_ are welcome!

Feel free to file issues on the dedicated page <>. Thanks to the contributors <> who submitted features and bugfixes!

Kudos to the following software libraries:

  • lxml <>, dateparser <>
  • A few patterns are derived from the python-goose <>, metascraper <>, newspaper <>_ and articleDateExtractor <>_ libraries. This module extends their coverage and robustness significantly.