Adrien Barbaresi

Results 412 comments of Adrien Barbaresi

Hi @zirkelc, it makes perfect sense to give back a confidence score to the user. I used a binary criterion at some point and then removed it. The feature would...

Trafilatura actually uses a combination of several extractors so the different scores wouldn't be commensurable. The best we could do would be to mimick a score and/or try to give...

Interesting idea, your "words among html2text" metric probably works well if you have several webpages from the same source. Then the variance will indeed be among the main text. This...

By all means, please go ahead. If everything works you'd have another extraction method and the confidence score would be a useful by-product, I'm curious to see how that goes.

@zirkelc I guess I can close this thread now that you've added the `is_probably_readarable` function?

Usually the bottom section contains unwanted links, however here there is actual content to be found. Especially with `include_links` on relevant parts are missing.

You could try `favor_recall=True` as a parameter to the extraction function. The culprit would be here, obviously the approach is limited as the fixed thresholds cannot work all the time:...

@mikhainin Thank you for reporting the bug and the solution, could you please draft a PR with your solution? If the tests pass I would integrate it.

Note: the issue is now fixed if recall option is on.

Thanks, I'll have a look at it later. I have a few fixes in the works, once a few PRs are merged I'll make a new release.