Adrien Barbaresi
Adrien Barbaresi
I have mostly tested `htmldate` on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web...
By default dates before 1995 are considered implausible, however changing the minimum date does not fix the issue. CLI: `htmldate -u "https://web.archive.org/web/20201205182452/https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083" -vv -min "1990-01-01"` Python: Here is the debugging...
Configuration arguments are available for Python functions, it would be nice to make them available as command-line arguments as well: - outputformat
A short version of the documentation is available straight from Github ([README.rst](https://github.com/adbar/htmldate/blob/master/README.rst)) while a more exhaustive one is present in the `docs` folder and online on [htmldate.readthedocs.io](https://htmldate.readthedocs.io) Several problems could...
e.g. im → "in dem" / in+dem / ? Taggers and treebanks mostly treat these tokens as one (so as in ` in+dem`) so as to not insert spaces.
- [x] http://unimorph.ethz.ch/languages - [x] https://github.com/lenakmeth/Wikinflection-Corpus - [x] https://github.com/TALP-UPC/FreeLing/tree/master/data/ - [x] https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data - [x] https://github.com/tatuylonen/wiktextract - [ ] http://pauillac.inria.fr/~sagot/index.html#udlexicons
Hi @felipehertzer, you have contributed significant parts of the code in `json_metadata.py`. As you know they are error-prone as certain web pages don't hold to standards. We can't do much...
See discussions in https://github.com/adbar/htmldate/issues/56 and https://github.com/adbar/htmldate/issues/57.
The `target_language` parameter can filter documents according to their language, but it is also possible to pass this information along, based on the HTML meta and text language detectors.