pdfminer.six
pdfminer.six copied to clipboard
LTTextLineHorizontal nested immediately under LTPage
With the demo PDF from this page (direct link to PDF), Pdfminer.six parses a few LTTextLineHorizontal
objects immediately under the LTPage
object. I don't think this is expected: for instance, it breaks the script in your documentation:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("demo1.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
print(character.fontname)
print(character.size)
with:
TypeError: 'LTChar' object is not iterable
Edit: completed the report with error message.
I have found a related issue: https://github.com/pdfminer/pdfminer.six/issues/526.
Btw, I am using Pdfminer.six v20220524.
This started happening from 20220319
version, possibly related to https://github.com/pdfminer/pdfminer.six/pull/659
Reproducible example:
import urllib.request
from io import BytesIO
import pdfminer.high_level
pdf_url = 'https://www.orimi.com/pdf-test.pdf'
pdfminer_page = list(
pdfminer.high_level.extract_pages(BytesIO(urllib.request.urlopen(pdf_url).read()))
)[0]
text_boxes = [i for i in pdfminer_page if hasattr(i, "get_text")]
print(text_boxes)
Before 20220319
version it shows
20211012
:
[<LTTextBoxHorizontal(0) 197.400,660.468,200.736,672.468 ' \n'>,
<LTTextBoxHorizontal(1) 72.000,455.448,532.853,661.368 ' \nPDF Test File \n \nCongratulations, your computer is equipped with a PDF (Portable Document Format) \nreader! You should be able to view any of the PDF documents and forms available on \nour site. PDF forms are indicated by these icons: \n \nYukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n \nPlease visit our website at: http://www.education.gov.yk.ca/\n \n'>,
<LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 ' or \n'>,
<LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '. \n'>]
20191107
(same result):
[<LTTextBoxHorizontal(0) 197.400,660.468,200.736,676.440 ' \n'>,
<LTTextBoxHorizontal(1) 72.000,455.448,532.853,665.340 ' \nPDF Test File \n \nCongratulations, your computer is equipped with a PDF (Portable Document Format) \nreader! You should be able to view any of the PDF documents and forms available on \nour site. PDF forms are indicated by these icons: \n \nYukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n \nPlease visit our website at: http://www.education.gov.yk.ca/\n \n'>,
<LTTextBoxHorizontal(2) 348.900,579.588,372.962,595.560 ' or \n'>,
<LTTextBoxHorizontal(3) 384.900,579.588,398.284,595.560 '. \n'>]
After (20220319
):
[<LTTextBoxHorizontal(0) 72.000,635.568,148.627,647.568 'PDF Test File \n'>,
<LTTextBoxHorizontal(1) 72.000,579.588,532.853,619.968 'Congratulations, your computer is equipped with a PDF (Portable Document Format) \nreader! You should be able to view any of the PDF documents and forms available on \nour site. PDF forms are indicated by these icons: \n'>,
<LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 ' or \n'>,
<LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '. \n'>,
<LTTextBoxHorizontal(4) 72.000,496.848,245.380,564.048 'Yukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n'>,
<LTTextBoxHorizontal(5) 72.000,469.248,389.460,481.248 'Please visit our website at: http://www.education.gov.yk.ca/\n'>,
<LTTextLineHorizontal 197.400,660.468,200.736,672.468 ' \n'>,
<LTTextLineHorizontal 72.000,649.368,75.336,661.368 ' \n'>,
<LTTextLineHorizontal 72.000,621.768,75.336,633.768 ' \n'>,
<LTTextLineHorizontal 72.000,565.848,75.336,577.848 ' \n'>,
<LTTextLineHorizontal 72.000,483.048,75.336,495.048 ' \n'>,
<LTTextLineHorizontal 72.000,455.448,82.061,467.448 ' \n'>]
20220524
:
[<LTTextBoxHorizontal(0) 72.000,635.568,148.627,647.568 'PDF Test File \n'>,
<LTTextBoxHorizontal(1) 72.000,579.588,532.853,619.968 'Congratulations, your computer is equipped with a PDF (Portable Document Format) \nreader! You should be able to view any of the PDF documents and forms available on \nour site. PDF forms are indicated by these icons: \n'>,
<LTTextBoxHorizontal(2) 348.900,579.588,372.962,591.588 ' or \n'>,
<LTTextBoxHorizontal(3) 384.900,579.588,398.284,591.588 '. \n'>,
<LTTextBoxHorizontal(4) 72.000,496.848,245.380,564.048 'Yukon Department of Education \nBox 2703 \nWhitehorse,Yukon \nCanada \nY1A 2C6 \n'>,
<LTTextBoxHorizontal(5) 72.000,469.248,389.460,481.248 'Please visit our website at: http://www.education.gov.yk.ca/\n'>,
<LTTextLineHorizontal 197.400,660.468,200.736,672.468 ' \n'>,
<LTTextLineHorizontal 72.000,649.368,75.336,661.368 ' \n'>,
<LTTextLineHorizontal 72.000,621.768,75.336,633.768 ' \n'>,
<LTTextLineHorizontal 72.000,565.848,75.336,577.848 ' \n'>,
<LTTextLineHorizontal 72.000,483.048,75.336,495.048 ' \n'>,
<LTTextLineHorizontal 72.000,455.448,82.061,467.448 ' \n'>]
This was introduced by: 43c8fc8557528463c99598049b7005ae96ab8084
This happens because these text lines only contain white space. Previously, all text lines with a zero width or high were added directly under the page object. After the change text lines with just white space are also added directly.
I guess it is preferable if the hierarchy is always the same. Always LTPage -> LTTextBox -> LtTextLine -> LTChar.
The empty textlines on this line (https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/layout.py#L949) need to be wrapped in a LTTextBox.
I'm picking up this issue.
This commit fixes the issue, but I'm unsure if it's the ideal way.
I tested the code with @lifepillar 's code and manually checked the hierarchy of the LT Objects.
But I'm getting tests/test_layout.py:130: AssertionError
and tests/test_layout.py:148: AssertionError
while running nox because the tests have hardcoded assert len(textboxes) == 3
and are throwing AssertionError: assert 7 == 3
.
Update: I've removed the branch to avoid confusion. All I did was add these two lines to the code after textboxes = list(self.group_textlines(laparams, textlines))
empties = list(self.group_textlines(laparams, empties))
textboxes.extend(empties)
@KunalGehlot Can you create a PR with that specific commit such that I can review and merge it?