Adrien Barbaresi

Results 412 comments of Adrien Barbaresi

@sarahyurick see the following line in `settings.cfg` and the [documentation page on settings](https://trafilatura.readthedocs.io/en/latest/settings.html): `MAX_REPETITIONS = 2`

That's correct, this is a bug indeed.

Hi @CNXDZS, I cannot reproduce the bug, changing parameters like `no_fallback` affects the output a bit but there is text.

Hi @dantetemplar, thanks for the link, the package is interesting indeed. They evaluate the packages on 200-300 webpages and forums in Chinese, this is a pretty specific use case and...

I had to tweak the evaluation script in the magic_html repository because it doesn't work as is. In the end magic_html is not better than other alternatives, the PR above...

@mcflem06 This is an interesting feature but could you please make sure the tests pass?

Hi @BramVanroy, thanks for the detailed report. The deduplication component works with a Least Recently Used cache (LRU), so its behavior depends on document order. It would not be thread-safe...

These are just German examples so the affix search works, I wouldn't add it for other languages, or maybe I don't understand the question? Depending on the language the UD...

I see! There are already rules for German, I guess affixes are not included because it would harm precision but I'll check again.