pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

AttributeError: 'PDFPageInterpreter' object has no attribute 'layout'

Open Laxmi530 opened this issue 2 years ago • 2 comments

Hai, Thank you for providing a beautiful library. Actually, I am trying to extract portion of text with respect to the heading like in the sample pdf file we select the heading ABSTRACT so as output we need the text from The game to penalty area. . I am trying the below code, but I am getting error. AttributeError: 'PDFPageInterpreter' object has no attribute 'layout' so can someone please guide me how fix this error.

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from io import StringIO

def extract_text_under_heading(pdf_file, heading):
    output_string = StringIO()
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    text = ""
    next_heading = None
    found_heading = False

    with open(pdf_file, 'rb') as file, StringIO() as output:
        for page in PDFPage.get_pages(file, check_extractable=True):
            interpreter.process_page(page)
            for element in interpreter.layout:
                if isinstance(element, LTTextBox):
                    if heading in element.get_text() and not found_heading:
                        found_heading = True
                        text += element.get_text()
                    elif found_heading:
                        text += element.get_text()
                        for h in headings:
                            if h in element.get_text():
                                next_heading = h
                                break
                        if next_heading:
                            break
    return text

Error

AttributeError                            Traceback (most recent call last)
Cell In [11], line 1
----> 1 extract_text_under_heading(file, 'ABSTRACT')

Cell In [10], line 20, in extract_text_under_heading(pdf_file, heading)
     18 for page in PDFPage.get_pages(file, check_extractable=True):
     19     interpreter.process_page(page)
---> 20     for element in interpreter.layout:
     21         if isinstance(element, LTTextBox):
     22             if heading in element.get_text() and not found_heading:

AttributeError: 'PDFPageInterpreter' object has no attribute 'layout'

Thank you in advance. Sample_PDF_file.pdf

Laxmi530 avatar Jan 29 '23 14:01 Laxmi530

Hi @Laxmi530, Right off the bat, I noticed some issues with your code. The "interpreter.layout" is probably older syntax which is no longer in use. you can extract layout from import from pdfminer.high_level import extract_pages

I have modified above function and ran it successfully on the Sample_PDF_file.pdf, see the modified code below :

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.high_level import extract_pages
from io import StringIO


def extract_text_under_heading(pdf_file, heading):
    output_string = StringIO()
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    text = ""
    headings = ["At Glance", "ABSTRACT", "At glance 1"]
    next_heading = None
    found_heading = False

    for page_layout in extract_pages(pdf_file):
        for element in page_layout:
            if isinstance(element, LTTextBox):
                if heading in element.get_text() and not found_heading:
                    found_heading = True
                    text += element.get_text()
                elif found_heading:
                    text += element.get_text()
                    for h in headings:
                        if h in element.get_text():
                            next_heading = h
                            break
                    if next_heading:
                        break
    return text

if __name__ == "__main__":
    pdf_file = r"C:\projects\git-repos\Sample_PDF_file.pdf"
    text = extract_text_under_heading(pdf_file, "ABSTRACT")
    print(text)

Output :

ABSTRACT  
The game of association football is played in accordance with the Laws of the Game, a set of rules that 
has been in effect since 1863 and maintained by the International Football Association Board (IFAB) 
since 1886.  
The game is played with a football that is 68–70 cm (27–28 in) in circumference. The two teams 
compete to get the ball into the other team's goal (between the posts and under the bar), thereby 
scoring a goal. When the ball is in play, the players mainly use their feet, but may use any other part of 
their body, except for their hands or arms, to control, strike, or pass the ball. Only the goalkeepers may 
use their hands and arms, and only then within the penalty area. 
At glance 1

please refer to latest documentation - https://pdfminersix.readthedocs.io/en/latest/tutorial/extract_pages.html

vilabho avatar Feb 13 '23 21:02 vilabho

@vilabho Thanks for the response. Let me explain the whole concept. Suppose you have a PDF file and you know only one heading name and you want the text under that heading that means in the sample pdf file ABSTRACT is a heading name so we need the text from The game to penalty area. and you don't know the next heading after ABSTRACT. In your code you explicitly mentioned the heading names in a list. So can you please help me how to extract the heading names from pdf file.

Thanks you.

Laxmi530 avatar Mar 17 '23 13:03 Laxmi530