Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

Port of is_probably_readerable from mozilla

Thanks, now the tests pass. I entered a series of minor changes to implement, the PR can soon be merged.

Port of is_probably_readerable from mozilla

LGTM.

Port of is_probably_readerable from mozilla

Additional notes: - The regular expressions used here are slightly different from the legacy ones at the top of the file, probably because they're newer? It would be nice to...

Port of is_probably_readerable from mozilla

I can take care of the docs before the next release and you can improve on that later if you want. As you say the readability_lxml module is out of...

focused_crawl returns nothing

The task is complex and the focused crawler integrated in Trafilatura does not solve all problems. I cannot answer this question in general. Do you have a precise example for...

focused_crawl returns nothing

If you set the logging level to debug you'll see that the download fails (403 error), so there are no links to extract.

focused_crawl returns nothing

You have to use a more complex download utility to make sure you get the full content, then you can use Trafilatura on the HTML.

Feat/simplify is known function

The PR moves `is_known()` out of the Lemmatizer class and removes the greedy argument, all good!

Feat/simplify is known function

Great, thanks.

Add option to provide XPaths for content extraction

You can add XPath expressions to remove elements but you cannot explicitly add elements, that could be a useful improvement. As for Reddit the extractor is not made for social...