Adrien Barbaresi
Adrien Barbaresi
Articles often feature text snippets describing further suggestions, not all of them are handled properly by the extractors. Is there an algorithmic way to discard such inserts? Example: "Mehr zum...
Feature requests like in #38 and #48 deal with inclusion of particular HTML elements in the output. To allow for easier inclusion and less hacky code it would be best...
**Problem** - Missing or scrambled information - Formatting: spaces between spans **Example** Retain birth dates and places on Wikipedia without adding boilerplate elements https://en.wikipedia.org/wiki/Rosanna_Carteri ``` Rosanna CarteriRosanna Carteri in 1964Born(1930-12-14)14...
The goal is to modify the internal subclass LXMLDocument so as to avoid converting back the output back from a string back to an LXML tree: https://github.com/adbar/trafilatura/blob/3b4cb19d615c2df17cdbb72a9309d724abcef91a/trafilatura/external.py#L56 The function `readibility.Document.get_clean_html`...
On content-rich webpages the algorithm does not seem to terminate, leading to a deadlock which has to be interrupted. See adbar/trafilatura#189 Here is an archived version of the page where...
Justext outputs the title of this webpage twice: https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html (archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html) The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).
### Issue description or question I recently let the GitHub bot write a PR. There are some things the bot missed when it comes to the `len()` statements, I'd like...
Hi, thanks for the package, which I'm using a lot in different projects. I was profiling my code with `pprofile` and noticed a potential performance issue in the function `get_tld_names()`....