Adrien Barbaresi
Adrien Barbaresi
@sarahyurick see the following line in `settings.cfg` and the [documentation page on settings](https://trafilatura.readthedocs.io/en/latest/settings.html): `MAX_REPETITIONS = 2`
That's correct, this is a bug indeed.
Hi @CNXDZS, I cannot reproduce the bug, changing parameters like `no_fallback` affects the output a bit but there is text.
Hi @dantetemplar, thanks for the link, the package is interesting indeed. They evaluate the packages on 200-300 webpages and forums in Chinese, this is a pretty specific use case and...
I had to tweak the evaluation script in the magic_html repository because it doesn't work as is. In the end magic_html is not better than other alternatives, the PR above...
@mcflem06 This is an interesting feature but could you please make sure the tests pass?
@mcflem06 Are you still working on it?
Hi @BramVanroy, thanks for the detailed report. The deduplication component works with a Least Recently Used cache (LRU), so its behavior depends on document order. It would not be thread-safe...
These are just German examples so the affix search works, I wouldn't add it for other languages, or maybe I don't understand the question? Depending on the language the UD...
I see! There are already rules for German, I guess affixes are not included because it would harm precision but I'll check again.