Adrien Barbaresi

Results 99 issues of Adrien Barbaresi

I have mostly tested `htmldate` on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web...

good first issue
up for grabs

By default dates before 1995 are considered implausible, however changing the minimum date does not fix the issue. CLI: `htmldate -u "https://web.archive.org/web/20201205182452/https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083" -vv -min "1990-01-01"` Python: Here is the debugging...

bug

Configuration arguments are available for Python functions, it would be nice to make them available as command-line arguments as well: - outputformat

enhancement

A short version of the documentation is available straight from Github ([README.rst](https://github.com/adbar/htmldate/blob/master/README.rst)) while a more exhaustive one is present in the `docs` folder and online on [htmldate.readthedocs.io](https://htmldate.readthedocs.io) Several problems could...

good first issue
up for grabs

e.g. im → "in dem" / in+dem / ? Taggers and treebanks mostly treat these tokens as one (so as in ` in+dem`) so as to not insert spaces.

question

- [x] http://unimorph.ethz.ch/languages - [x] https://github.com/lenakmeth/Wikinflection-Corpus - [x] https://github.com/TALP-UPC/FreeLing/tree/master/data/ - [x] https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data - [x] https://github.com/tatuylonen/wiktextract - [ ] http://pauillac.inria.fr/~sagot/index.html#udlexicons

enhancement

Hi @felipehertzer, you have contributed significant parts of the code in `json_metadata.py`. As you know they are error-prone as certain web pages don't hold to standards. We can't do much...

feedback

See discussions in https://github.com/adbar/htmldate/issues/56 and https://github.com/adbar/htmldate/issues/57.

documentation

The `target_language` parameter can filter documents according to their language, but it is also possible to pass this information along, based on the HTML meta and text language detectors.

enhancement