jusText icon indicating copy to clipboard operation
jusText copied to clipboard

Preprocessing of header blocks

Open jojennin opened this issue 2 years ago • 3 comments

Thanks sharing and maintaining the repository. I find the code very readable and great for extracting text from HTML.

It seems that the current version (3.0.0) does not process header blocks as is described in the documentation:

The preprocessing looks for short header blocks which precede good blocks and at the same time there is no more than MAX_HEADING_DISTANCE characters between the header block and the good block.

The code for this appears to be implemented on line 314 of core.py and works by searching for good blocks that come after the short header block by looking at the class_type attribute. However, at this point in the code, the class_type of each paragraph has only been initialized and not copied from cf_class, so when searching for a good block within max_heading_distance characters, a good block will never be found, as class_type has not been copied for subsequent paragraphs.

Here is a simple dummy HTML file that I have created to provide an example of how a short block that comes between a short header and a good block will be classified as "bad". However, had the header block been labeled as "neargood" from the preprocessing step, the short block would have been labeled as "good".

Is this preprocessing of the header blocks as described above intended? If so, then I think the documentation should be updated to reflect this change. Otherwise, it seems a simple loop over the paragraphs copying cf_class to class_type (as was done in version 2.2.0) would suffice to fix the issue.

jojennin avatar Nov 09 '22 03:11 jojennin

Hello @jojennin, thanks for your detailed report and kind words. If this worked in v2.2 and not now I had to break something :( I will quickly go through git bisect if I manage to find the commit. Thank you for the HTML to test this.

miso-belica avatar Nov 26 '22 14:11 miso-belica

I just created a test for version in main branch and it seems it is working. Maybe I don't understand the issue here? Can you please create a test where you can demonstrate that the current version fails?

from unittest.mock import ANY


def test_heading_should_be_processed():
    html = """<html>
      <head>
        <title>Eddie Brunner - Wikipedia</title>
      </head>
      <body>
        <h1>Eddie Brunner</h1>
        <p>
        <b>Eduard "Eddie" Brunner</b> (July 19, 1912 - July 18, 1960, Zurich).
        </p>
        <p>
        Eddie Brunner was a Swiss jazz reedist and bandleader.
        Brunner learned to play clarinet, piano, and tenor and alto saxophone,
        and began playing professionally in the early 1930s with Rene Dumont,
        Jack and Louis de Vries and Marek Weber. In 1936 he moved to Paris and
        recorded under his own name as well as with the Goldene Sieben and Louis Bacon;
        he moved back to Switzerland once the
        <a href="/wiki/World_War_II" title="World War II">war</a>
        had begun. He joined Teddy Stauffer's band, and in 1941 took over leadership of 
        the group until 1947, when it dissolved. He led a new six-piece ensemble in 1948,
        and recorded for radio and television broadcasts in the 1950s.
        </p>
      </body>
    </html>"""
    paragraphs = justext.justext(html, justext.get_stoplist("English"))

    assert [p.text for p in paragraphs] == [
        "Eddie Brunner",
        'Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich).',
        ANY,
    ]

miso-belica avatar Nov 26 '22 14:11 miso-belica

Thank you @miso-belica for your response.

The issue is not in the retrieval of paragraphs, rather in the classification (specifically in the context-aware classification). Because the cf_class attribute is not copied to the class_type attribute of all paragraphs before the start of the context-aware classification, the classification of short headings is not consistent with what is mentioned in the documentation (more details on where this occurs in the code are in my original post). I provide an example below:

import justext
import justext.core as jt

html = """<html>
      <head>
        <title>Eddie Brunner - Wikipedia</title>
      </head>
      <body>
        <h1>Eddie Brunner</h1>
        <p>
        <b>Eduard "Eddie" Brunner</b> (July 19, 1912 - July 18, 1960, Zurich).
        </p>
        <p>
        Eddie Brunner was a Swiss jazz reedist and bandleader.
        Brunner learned to play clarinet, piano, and tenor and alto saxophone,
        and began playing professionally in the early 1930s with Rene Dumont,
        Jack and Louis de Vries and Marek Weber. In 1936 he moved to Paris and
        recorded under his own name as well as with the Goldene Sieben and Louis Bacon;
        he moved back to Switzerland once the
        <a href="/wiki/World_War_II" title="World War II">war</a>
        had begun. He joined Teddy Stauffer's band, and in 1941 took over leadership of
        the group until 1947, when it dissolved. He led a new six-piece ensemble in 1948,
        and recorded for radio and television broadcasts in the 1950s.
        </p>
      </body>
    </html>
"""


def justext_v3(html, stoplist):
  paragraphs = justext.justext(html, stoplist)

  return paragraphs


def justext_v220_copyclass(html, stoplist):
  dom = jt.html_to_dom(html)
  dom = jt.preprocessor(dom)

  paragraphs = jt.ParagraphMaker.make_paragraphs(dom)

  jt.classify_paragraphs(paragraphs, stoplist)

  # Copy the class_type attribute
  # as is done in v2.2.0 (Line 296 of core.py in v2.2.0)
  # Commenting out this loop will result in the same classification
  # as the current version
  for paragraph in paragraphs:
    paragraph.class_type = paragraph.cf_class

  jt.revise_paragraph_classification(paragraphs)

  return paragraphs


def main():

  stoplist = justext.get_stoplist("English")
  paragraphs = justext_v3(html, stoplist)

  paragraphs_v220_copyclass = justext_v220_copyclass(html, stoplist)

  for p3, p2 in zip(paragraphs, paragraphs_v220_copyclass):
    if p3.class_type != p2.class_type:
      print(p3.text, p3.class_type)
      print(p2.text, p2.class_type)


if __name__ == "__main__":
  main()

Running this script results in the following:

Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich). bad
Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich). good

If you comment out the loop that copies the cf_class to class_type attribute, the classification is the same as the current version of jusText.

The difference is due to the classification of short header blocks. If the cf_class is copied to class_type before the context-aware classification, then the first loop in the function revise_paragraph_classification (the header preprocessing loop) changes the attribute class_type of the short header block (<h1>Eddie Brunner</h1>) from short to neargood. Consequently, this changes the classfication of the paragraph <b>Eduard "Eddie" Brunner (July 19, 1912 - July 18, 1960, Zurich).</b> from bad to good which occurs in the loop that processes short paragraphs.

jojennin avatar Dec 04 '22 05:12 jojennin