Adrien Barbaresi

Results 412 comments of Adrien Barbaresi

That's interesting, feel free to share results if you find anything else, I patched together a series of heuristics but I lack the time to evaluate them properly. Besides, the...

It changes the extraction parameters so the thresholds below have to be revised.

Hi @majcl, the problem is that bulk removal of certain sections can impact recall. Maybe I missed something though, if the tests pass feel free to draft a pull request,...

Hi @kondounagi, could you please be more specific? Please bear in mind however that the library is not geared towards browser automation.

I understand, thanks for the details. The issue seems related to #322 and could be addressed by providing a more configurable function to fetch HTML documents. Let's keep the issue...

@kondounagi Can you try the new `Response` object described in issue #322 and see if it works for you?

This is not only about the section you mention, there are a number a problems with deeply nested tags on this page.

Hi @clach04, I believe people would find it useful indeed. You may want to look at this repository which implements a basic API: https://github.com/Soontao/trafilatura-srv (see `app.py`)

@clach04 There is now an official implementation, see the [API page in the docs](https://trafilatura.readthedocs.io/en/latest/usage-api.html).

Hi @niksite, that's correct, I'm going to look into it.