jusText
jusText copied to clipboard
Preprocessing of header blocks
Thanks sharing and maintaining the repository. I find the code very readable and great for extracting text from HTML.
It seems that the current version (3.0.0) does not process header blocks as is described in the documentation:
The preprocessing looks for short header blocks which precede good blocks and at the same time there is no more than MAX_HEADING_DISTANCE characters between the header block and the good block.
The code for this appears to be implemented on line 314 of core.py and works by searching for good blocks that come after the short header block by looking at the class_type
attribute. However, at this point in the code, the class_type
of each paragraph has only been initialized and not copied from cf_class
, so when searching for a good block within max_heading_distance
characters, a good block will never be found, as class_type
has not been copied for subsequent paragraphs.
Here is a simple dummy HTML file that I have created to provide an example of how a short block that comes between a short header and a good block will be classified as "bad". However, had the header block been labeled as "neargood" from the preprocessing step, the short block would have been labeled as "good".
Is this preprocessing of the header blocks as described above intended? If so, then I think the documentation should be updated to reflect this change. Otherwise, it seems a simple loop over the paragraphs copying cf_class
to class_type
(as was done in version 2.2.0) would suffice to fix the issue.
Hello @jojennin, thanks for your detailed report and kind words. If this worked in v2.2 and not now I had to break something :( I will quickly go through git bisect if I manage to find the commit. Thank you for the HTML to test this.
I just created a test for version in main branch and it seems it is working. Maybe I don't understand the issue here? Can you please create a test where you can demonstrate that the current version fails?
from unittest.mock import ANY
def test_heading_should_be_processed():
html = """<html>
<head>
<title>Eddie Brunner - Wikipedia</title>
</head>
<body>
<h1>Eddie Brunner</h1>
<p>
<b>Eduard "Eddie" Brunner</b> (July 19, 1912 - July 18, 1960, Zurich).
</p>
<p>
Eddie Brunner was a Swiss jazz reedist and bandleader.
Brunner learned to play clarinet, piano, and tenor and alto saxophone,
and began playing professionally in the early 1930s with Rene Dumont,
Jack and Louis de Vries and Marek Weber. In 1936 he moved to Paris and
recorded under his own name as well as with the Goldene Sieben and Louis Bacon;
he moved back to Switzerland once the
<a href="/wiki/World_War_II" title="World War II">war</a>
had begun. He joined Teddy Stauffer's band, and in 1941 took over leadership of
the group until 1947, when it dissolved. He led a new six-piece ensemble in 1948,
and recorded for radio and television broadcasts in the 1950s.
</p>
</body>
</html>"""
paragraphs = justext.justext(html, justext.get_stoplist("English"))
assert [p.text for p in paragraphs] == [
"Eddie Brunner",
'Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich).',
ANY,
]
Thank you @miso-belica for your response.
The issue is not in the retrieval of paragraphs, rather in the classification (specifically in the context-aware classification). Because the cf_class
attribute is not copied to the class_type
attribute of all paragraphs before the start of the context-aware classification, the classification of short headings is not consistent with what is mentioned in the documentation (more details on where this occurs in the code are in my original post). I provide an example below:
import justext
import justext.core as jt
html = """<html>
<head>
<title>Eddie Brunner - Wikipedia</title>
</head>
<body>
<h1>Eddie Brunner</h1>
<p>
<b>Eduard "Eddie" Brunner</b> (July 19, 1912 - July 18, 1960, Zurich).
</p>
<p>
Eddie Brunner was a Swiss jazz reedist and bandleader.
Brunner learned to play clarinet, piano, and tenor and alto saxophone,
and began playing professionally in the early 1930s with Rene Dumont,
Jack and Louis de Vries and Marek Weber. In 1936 he moved to Paris and
recorded under his own name as well as with the Goldene Sieben and Louis Bacon;
he moved back to Switzerland once the
<a href="/wiki/World_War_II" title="World War II">war</a>
had begun. He joined Teddy Stauffer's band, and in 1941 took over leadership of
the group until 1947, when it dissolved. He led a new six-piece ensemble in 1948,
and recorded for radio and television broadcasts in the 1950s.
</p>
</body>
</html>
"""
def justext_v3(html, stoplist):
paragraphs = justext.justext(html, stoplist)
return paragraphs
def justext_v220_copyclass(html, stoplist):
dom = jt.html_to_dom(html)
dom = jt.preprocessor(dom)
paragraphs = jt.ParagraphMaker.make_paragraphs(dom)
jt.classify_paragraphs(paragraphs, stoplist)
# Copy the class_type attribute
# as is done in v2.2.0 (Line 296 of core.py in v2.2.0)
# Commenting out this loop will result in the same classification
# as the current version
for paragraph in paragraphs:
paragraph.class_type = paragraph.cf_class
jt.revise_paragraph_classification(paragraphs)
return paragraphs
def main():
stoplist = justext.get_stoplist("English")
paragraphs = justext_v3(html, stoplist)
paragraphs_v220_copyclass = justext_v220_copyclass(html, stoplist)
for p3, p2 in zip(paragraphs, paragraphs_v220_copyclass):
if p3.class_type != p2.class_type:
print(p3.text, p3.class_type)
print(p2.text, p2.class_type)
if __name__ == "__main__":
main()
Running this script results in the following:
Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich). bad
Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich). good
If you comment out the loop that copies the cf_class
to class_type
attribute, the classification is the same as the current version of jusText.
The difference is due to the classification of short header blocks. If the cf_class
is copied to class_type
before the context-aware classification, then the first loop in the function revise_paragraph_classification
(the header preprocessing loop) changes the attribute class_type
of the short header block (<h1>Eddie Brunner</h1>
) from short to neargood. Consequently, this changes the classfication of the paragraph <b>Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich).</b>
from bad to good which occurs in the loop that processes short paragraphs.