Adrien Barbaresi issues

Results 99 issues of


                                            Adrien Barbaresi

Use `with_metadata` parameter to decide whether to run metadata extraction

So far this parameter is pending deprecation. It could be re-used to do what most expect: decide manually (not only based on output format) whether to run metadata extraction. Focusing...

enhancement

Make cascade of different content extractors explicit and configurable

So far Trafilatura is entwined with a version of readability-lxml, it also uses jusText as fallback before triggering the baseline extraction as last resort. This combination is robust and performs...

enhancement

Downloads: Add ZStandard as optional Accept-Encoding header

- If the corresponding Python package is installed, add [Accept-Encoding: zstd](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding) to HTTP headers and request processing - Review the Accept-Encoding headers in general (e.g. quality value syntax)

enhancement

Link proportion heuristic fails for link paragraph

Articles on the Fox News website contain links to other articles in the middle of the texts, the links all follow this pattern: p > a > strong > u...

bug

Link section missed at bottom of page

### Discussed in https://github.com/adbar/trafilatura/discussions/516 Originally posted by **mertdeveci5** February 29, 2024 I read that this might be a feature request hence sharing here if someone figured it out. On using...

bug

PDF as output format?

There are Python libraries to convert the output to PDF, would the result be robust and stable and would it be a useful addition? Note: this way around is definitely...

feedback

Missing <p>-text

Hi, the bug listed here is related to `readability-lxml`: https://github.com/adbar/trafilatura/issues/43 In the following the first p-element (_Mit dem KfW-Unternehmerkredit..._) is missing from the output: ``` html_fragment = '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\n\t\n\nMit dem KfW-Unternehmerkredit...

Adrien Barbaresi

Use `with_metadata` parameter to decide whether to run metadata extraction

Make cascade of different content extractors explicit and configurable

Downloads: Add ZStandard as optional Accept-Encoding header

Link proportion heuristic fails for link paragraph

Link section missed at bottom of page

PDF as output format?

Missing <p>-text

Orphan links in doc.summary()

Pass LXML object straight to readability?

Python3-compatibility and code cleaning