jusText icon indicating copy to clipboard operation
jusText copied to clipboard

Heuristic based boilerplate removal tool

Results 11 jusText issues
Sort by recently updated
recently updated
newest added

On content-rich webpages the algorithm does not seem to terminate, leading to a deadlock which has to be interrupted. See adbar/trafilatura#189 Here is an archived version of the page where...

bug

Justext outputs the title of this webpage twice: https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html (archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html) The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).

wont-fix

While jusText extracts the page encoding for a HTML page correctly from the meta attribute, it does not for XHTML, which uses an XML header: ``` ```

Hey, @miso-belica It seems like jusText can not extract content from html lists (ul, ol tags). For example, only "Some text A. Some text C." will be extracted from: ```Some...

bug

I was trying to use JusText 2.2.0 with Python 3.5. However, after installing and running one of the example extractions without piping it to a file and just dumping to...

https://github.com/miso-belica/jusText/blob/dev/justext/stoplists/German.txt Most of those words are no stop words. For example "Saison", "Jahrhunderts", "Titel" and many more.

bug

I've installed JusText on a Windows 2012 Server machine and it seems to be running fine overall. However, about 30-40% of the HTML files crash because of encoding issues. The...

bug

I'd like to bring to your attention that we are [discussing](https://bugs.launchpad.net/lxml/+bug/1958539) the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning...

enhancement
dependencies

Thanks sharing and maintaining the repository. I find the code very readable and great for extracting text from HTML. It seems that the current version (3.0.0) does not process header...

bug

The newest version of LXML breaks the current code due to changes in `lxml.html.clean` (see also #46).