pdfminer.six
pdfminer.six copied to clipboard
AttributeError: 'PDFPageInterpreter' object has no attribute 'layout'
Hai,
Thank you for providing a beautiful library.
Actually, I am trying to extract portion of text with respect to the heading like in the sample pdf file we select the heading ABSTRACT
so as output we need the text from The game
to penalty area.
. I am trying the below code, but I am getting error. AttributeError: 'PDFPageInterpreter' object has no attribute 'layout'
so can someone please guide me how fix this error.
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from io import StringIO
def extract_text_under_heading(pdf_file, heading):
output_string = StringIO()
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
text = ""
next_heading = None
found_heading = False
with open(pdf_file, 'rb') as file, StringIO() as output:
for page in PDFPage.get_pages(file, check_extractable=True):
interpreter.process_page(page)
for element in interpreter.layout:
if isinstance(element, LTTextBox):
if heading in element.get_text() and not found_heading:
found_heading = True
text += element.get_text()
elif found_heading:
text += element.get_text()
for h in headings:
if h in element.get_text():
next_heading = h
break
if next_heading:
break
return text
Error
AttributeError Traceback (most recent call last)
Cell In [11], line 1
----> 1 extract_text_under_heading(file, 'ABSTRACT')
Cell In [10], line 20, in extract_text_under_heading(pdf_file, heading)
18 for page in PDFPage.get_pages(file, check_extractable=True):
19 interpreter.process_page(page)
---> 20 for element in interpreter.layout:
21 if isinstance(element, LTTextBox):
22 if heading in element.get_text() and not found_heading:
AttributeError: 'PDFPageInterpreter' object has no attribute 'layout'
Thank you in advance. Sample_PDF_file.pdf
Hi @Laxmi530,
Right off the bat, I noticed some issues with your code. The "interpreter.layout
" is probably older syntax which is no longer in use. you can extract layout from import from pdfminer.high_level import extract_pages
I have modified above function and ran it successfully on the Sample_PDF_file.pdf, see the modified code below :
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.high_level import extract_pages
from io import StringIO
def extract_text_under_heading(pdf_file, heading):
output_string = StringIO()
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
text = ""
headings = ["At Glance", "ABSTRACT", "At glance 1"]
next_heading = None
found_heading = False
for page_layout in extract_pages(pdf_file):
for element in page_layout:
if isinstance(element, LTTextBox):
if heading in element.get_text() and not found_heading:
found_heading = True
text += element.get_text()
elif found_heading:
text += element.get_text()
for h in headings:
if h in element.get_text():
next_heading = h
break
if next_heading:
break
return text
if __name__ == "__main__":
pdf_file = r"C:\projects\git-repos\Sample_PDF_file.pdf"
text = extract_text_under_heading(pdf_file, "ABSTRACT")
print(text)
Output :
ABSTRACT
The game of association football is played in accordance with the Laws of the Game, a set of rules that
has been in effect since 1863 and maintained by the International Football Association Board (IFAB)
since 1886.
The game is played with a football that is 68–70 cm (27–28 in) in circumference. The two teams
compete to get the ball into the other team's goal (between the posts and under the bar), thereby
scoring a goal. When the ball is in play, the players mainly use their feet, but may use any other part of
their body, except for their hands or arms, to control, strike, or pass the ball. Only the goalkeepers may
use their hands and arms, and only then within the penalty area.
At glance 1
please refer to latest documentation - https://pdfminersix.readthedocs.io/en/latest/tutorial/extract_pages.html
@vilabho Thanks for the response. Let me explain the whole concept.
Suppose you have a PDF file and you know only one heading name and you want the text under that heading that means in the sample pdf file ABSTRACT
is a heading name so we need the text from The game
to penalty area.
and you don't know the next heading after ABSTRACT
. In your code you explicitly mentioned the heading names in a list.
So can you please help me how to extract the heading names from pdf file.
Thanks you.