Adrien Barbaresi

Results 412 comments of Adrien Barbaresi

Thanks, now the tests pass. I entered a series of minor changes to implement, the PR can soon be merged.

Additional notes: - The regular expressions used here are slightly different from the legacy ones at the top of the file, probably because they're newer? It would be nice to...

I can take care of the docs before the next release and you can improve on that later if you want. As you say the readability_lxml module is out of...

The task is complex and the focused crawler integrated in Trafilatura does not solve all problems. I cannot answer this question in general. Do you have a precise example for...

If you set the logging level to debug you'll see that the download fails (403 error), so there are no links to extract.

You have to use a more complex download utility to make sure you get the full content, then you can use Trafilatura on the HTML.

The PR moves `is_known()` out of the Lemmatizer class and removes the greedy argument, all good!

You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement. As for Reddit the extractor is not made for social...