Adrien Barbaresi issues

Results 99 issues of


                                            Adrien Barbaresi

Teaser with link in article flow

Articles often feature text snippets describing further suggestions, not all of them are handled properly by the extractors. Is there an algorithmic way to discard such inserts? Example: "Mehr zum...

enhancement

Refactor code to provide a "keep-tags" option

Feature requests like in #38 and #48 deal with inclusion of particular HTML elements in the output. To allow for easier inclusion and less hacky code it would be best...

enhancement

Keeping all valid table information and formatting

**Problem** - Missing or scrambled information - Formatting: spaces between spans **Example** Retain birth dates and places on Wikipedia without adding boilerplate elements https://en.wikipedia.org/wiki/Rosanna_Carteri ``` Rosanna CarteriRosanna Carteri in 1964Born(1930-12-14)14...

bug

Investigate potential speed-up with customized readability-lxml

The goal is to modify the internal subclass LXMLDocument so as to avoid converting back the output back from a string back to an LXML tree: https://github.com/adbar/trafilatura/blob/3b4cb19d615c2df17cdbb72a9309d724abcef91a/trafilatura/external.py#L56 The function `readibility.Document.get_clean_html`...

enhancement

Extraction does not terminate

On content-rich webpages the algorithm does not seem to terminate, leading to a deadlock which has to be interrupted. See adbar/trafilatura#189 Here is an archived version of the page where...

bug

Duplicate text output

Justext outputs the title of this webpage twice: https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html (archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html) The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).

wont-fix

investigate

help appreciated

work in progress

Adrien Barbaresi

Teaser with link in article flow

Refactor code to provide a "keep-tags" option

Keeping all valid table information and formatting

Investigate potential speed-up with customized readability-lxml

Extraction does not terminate

Duplicate text output

added a scraping tool for Python

Chained len statements

added a scraping tool for Python

Performance issue with internal trie structure?