pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

LTTextLineHorizontal nested immediately under LTPage

Open lifepillar opened this issue 2 years ago • 6 comments

With the demo PDF from this page (direct link to PDF), Pdfminer.six parses a few LTTextLineHorizontal objects immediately under the LTPage object. I don't think this is expected: for instance, it breaks the script in your documentation:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("demo1.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        print(character.fontname)
                        print(character.size)

with:

TypeError: 'LTChar' object is not iterable

Edit: completed the report with error message.

lifepillar avatar Jun 03 '22 15:06 lifepillar

I have found a related issue: https://github.com/pdfminer/pdfminer.six/issues/526.

Btw, I am using Pdfminer.six v20220524.

lifepillar avatar Jun 03 '22 18:06 lifepillar

This started happening from 20220319 version, possibly related to https://github.com/pdfminer/pdfminer.six/pull/659

Reproducible example:

import urllib.request
from io import BytesIO

import pdfminer.high_level

pdf_url = 'https://www.orimi.com/pdf-test.pdf'
pdfminer_page = list(
    pdfminer.high_level.extract_pages(BytesIO(urllib.request.urlopen(pdf_url).read()))
)[0]

text_boxes = [i for i in pdfminer_page if hasattr(i, "get_text")]
print(text_boxes)

Before 20220319 version it shows 20211012:

[<LTTextBoxHorizontal(0) 197.400,660.468,200.736,672.468 ' \n'>,
 <LTTextBoxHorizontal(1) 72.000,455.448,532.853,661.368 ' \nPDF Test File \n \nCongratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n \nYukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n \nPlease visit our website at:  http://www.education.gov.yk.ca/\n   \n'>,
 <LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 '  or  \n'>,
 <LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '.   \n'>]

20191107 (same result):

[<LTTextBoxHorizontal(0) 197.400,660.468,200.736,676.440 ' \n'>,
 <LTTextBoxHorizontal(1) 72.000,455.448,532.853,665.340 ' \nPDF Test File \n \nCongratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n \nYukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n \nPlease visit our website at:  http://www.education.gov.yk.ca/\n   \n'>,
 <LTTextBoxHorizontal(2) 348.900,579.588,372.962,595.560 '  or  \n'>,
 <LTTextBoxHorizontal(3) 384.900,579.588,398.284,595.560 '.   \n'>]

After (20220319):

[<LTTextBoxHorizontal(0) 72.000,635.568,148.627,647.568 'PDF Test File \n'>,
 <LTTextBoxHorizontal(1) 72.000,579.588,532.853,619.968 'Congratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n'>,
 <LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 '  or  \n'>,
 <LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '.   \n'>,
 <LTTextBoxHorizontal(4) 72.000,496.848,245.380,564.048 'Yukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n'>,
 <LTTextBoxHorizontal(5) 72.000,469.248,389.460,481.248 'Please visit our website at:  http://www.education.gov.yk.ca/\n'>,
 <LTTextLineHorizontal 197.400,660.468,200.736,672.468 ' \n'>,
 <LTTextLineHorizontal 72.000,649.368,75.336,661.368 ' \n'>,
 <LTTextLineHorizontal 72.000,621.768,75.336,633.768 ' \n'>,
 <LTTextLineHorizontal 72.000,565.848,75.336,577.848 ' \n'>,
 <LTTextLineHorizontal 72.000,483.048,75.336,495.048 ' \n'>,
 <LTTextLineHorizontal 72.000,455.448,82.061,467.448 '   \n'>]

20220524:

[<LTTextBoxHorizontal(0) 72.000,635.568,148.627,647.568 'PDF Test File \n'>,
 <LTTextBoxHorizontal(1) 72.000,579.588,532.853,619.968 'Congratulations, your computer is equipped with a PDF (Portable Document Format) \nreader!  You should be able to view any of the PDF documents and forms available on \nour site.  PDF forms are indicated by these icons: \n'>,
 <LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 '  or  \n'>,
 <LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '.   \n'>,
 <LTTextBoxHorizontal(4) 72.000,496.848,245.380,564.048 'Yukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n'>,
 <LTTextBoxHorizontal(5) 72.000,469.248,389.460,481.248 'Please visit our website at:  http://www.education.gov.yk.ca/\n'>,
 <LTTextLineHorizontal 197.400,660.468,200.736,672.468 ' \n'>,
 <LTTextLineHorizontal 72.000,649.368,75.336,661.368 ' \n'>,
 <LTTextLineHorizontal 72.000,621.768,75.336,633.768 ' \n'>,
 <LTTextLineHorizontal 72.000,565.848,75.336,577.848 ' \n'>,
 <LTTextLineHorizontal 72.000,483.048,75.336,495.048 ' \n'>,
 <LTTextLineHorizontal 72.000,455.448,82.061,467.448 '   \n'>]

hellpanderrr avatar Jun 11 '22 02:06 hellpanderrr

This was introduced by: 43c8fc8557528463c99598049b7005ae96ab8084

pietermarsman avatar Jun 25 '22 19:06 pietermarsman

This happens because these text lines only contain white space. Previously, all text lines with a zero width or high were added directly under the page object. After the change text lines with just white space are also added directly.

I guess it is preferable if the hierarchy is always the same. Always LTPage -> LTTextBox -> LtTextLine -> LTChar.

The empty textlines on this line (https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/layout.py#L949) need to be wrapped in a LTTextBox.

pietermarsman avatar Jun 25 '22 20:06 pietermarsman

I'm picking up this issue.

KunalGehlot avatar Aug 23 '22 09:08 KunalGehlot

This commit fixes the issue, but I'm unsure if it's the ideal way.

I tested the code with @lifepillar 's code and manually checked the hierarchy of the LT Objects.

But I'm getting tests/test_layout.py:130: AssertionError and tests/test_layout.py:148: AssertionError while running nox because the tests have hardcoded assert len(textboxes) == 3 and are throwing AssertionError: assert 7 == 3.

Update: I've removed the branch to avoid confusion. All I did was add these two lines to the code after textboxes = list(self.group_textlines(laparams, textlines))

empties = list(self.group_textlines(laparams, empties))
textboxes.extend(empties)

KunalGehlot avatar Aug 24 '22 10:08 KunalGehlot

@KunalGehlot Can you create a PR with that specific commit such that I can review and merge it?

pietermarsman avatar Oct 15 '22 07:10 pietermarsman