Adrien Barbaresi

Results 99 issues of Adrien Barbaresi

So far this parameter is pending deprecation. It could be re-used to do what most expect: decide manually (not only based on output format) whether to run metadata extraction. Focusing...

enhancement

So far Trafilatura is entwined with a version of readability-lxml, it also uses jusText as fallback before triggering the baseline extraction as last resort. This combination is robust and performs...

enhancement

- If the corresponding Python package is installed, add [Accept-Encoding: zstd](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding) to HTTP headers and request processing - Review the Accept-Encoding headers in general (e.g. quality value syntax)

enhancement

Articles on the Fox News website contain links to other articles in the middle of the texts, the links all follow this pattern: p > a > strong > u...

bug

### Discussed in https://github.com/adbar/trafilatura/discussions/516 Originally posted by **mertdeveci5** February 29, 2024 I read that this might be a feature request hence sharing here if someone figured it out. On using...

bug

There are Python libraries to convert the output to PDF, would the result be robust and stable and would it be a useful addition? Note: this way around is definitely...

feedback

Hi, the bug listed here is related to `readability-lxml`: https://github.com/adbar/trafilatura/issues/43 In the following the first p-element (_Mit dem KfW-Unternehmerkredit..._) is missing from the output: ``` html_fragment = '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\n\t\n\nMit dem KfW-Unternehmerkredit...

Hi, a user run into this bug: https://github.com/adbar/trafilatura/issues/21 There are links which end up being orphans between paragraphs, which messes up text rendering and conversion. The problem comes from the...

As of now only strings containing HTML seem to be acceptable input. Is there a way to pass an object parsed by LXML or `lxml.html` (types: `etree._ElementTree` and `html.HtmlElement`) straight...

- make code compatible with Python3 - cleaning and linting - try/catch fix around type error in model.py