Adrien Barbaresi
Adrien Barbaresi
Thanks, now the tests pass. I entered a series of minor changes to implement, the PR can soon be merged.
Additional notes: - The regular expressions used here are slightly different from the legacy ones at the top of the file, probably because they're newer? It would be nice to...
I can take care of the docs before the next release and you can improve on that later if you want. As you say the readability_lxml module is out of...
The task is complex and the focused crawler integrated in Trafilatura does not solve all problems. I cannot answer this question in general. Do you have a precise example for...
If you set the logging level to debug you'll see that the download fails (403 error), so there are no links to extract.
You have to use a more complex download utility to make sure you get the full content, then you can use Trafilatura on the HTML.
The PR moves `is_known()` out of the Lemmatizer class and removes the greedy argument, all good!
Great, thanks.
You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement. As for Reddit the extractor is not made for social...