jusText issues

Extraction does not terminate

4

On content-rich webpages the algorithm does not seem to terminate, leading to a deadlock which has to be interrupted. See adbar/trafilatura#189 Here is an archived version of the page where...

adbar

bug

Justext outputs the title of this webpage twice: https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html (archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html) The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).

adbar

wont-fix

XML encoding is not taken into account

3

While jusText extracts the page encoding for a HTML page correctly from the meta attribute, it does not for XHTML, which uses an XML header: ``` ```

DavidNemeskey

jusText skips content from HTML lists (ul, ol)

1

Hey, @miso-belica It seems like jusText can not extract content from html lists (ul, ol tags). For example, only "Some text A. Some text C." will be extracted from: ```Some...

polosatyi

bug

String vs. Unicode error with Python 3.5

1

I was trying to use JusText 2.2.0 with Python 3.5. However, after installing and running one of the example extractions without piping it to a file and just dumping to...

fnl

Broken stopword list (German)

6

https://github.com/miso-belica/jusText/blob/dev/justext/stoplists/German.txt Most of those words are no stop words. For example "Saison", "Jahrhunderts", "Titel" and many more.

schreon

bug

UnicodeDecodeError when crawling pages

I've installed JusText on a Windows 2012 Server machine and it seems to be running fine overall. However, about 30-40% of the HTML files crash because of encoding issues. The...

miso-belica

bug

Consider switching from lxml's clean_html for enhanced security (and possibly performance)

2

I'd like to bring to your attention that we are [discussing](https://bugs.launchpad.net/lxml/+bug/1958539) the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning...

frenzymadness

enhancement

dependencies

Preprocessing of header blocks

3

Thanks sharing and maintaining the repository. I find the code very readable and great for extracting text from HTML. It seems that the current version (3.0.0) does not process header...

jojennin

bug

Broken import with lxml >= 5.2.0

The newest version of LXML breaks the current code due to changes in `lxml.html.clean` (see also #46).

adbar

jusText
jusText copied to clipboard

Metadata

Extraction does not terminate

Duplicate text output

XML encoding is not taken into account

jusText skips content from HTML lists (ul, ol)

String vs. Unicode error with Python 3.5

Broken stopword list (German)

UnicodeDecodeError when crawling pages

Consider switching from lxml's clean_html for enhanced security (and possibly performance)

Preprocessing of header blocks

Broken import with lxml >= 5.2.0

← Metadata

Owner

Metadata

jusText jusText copied to clipboard

Metadata

← Metadata

Owner

Metadata

jusText
jusText copied to clipboard