pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Issue with extract_pages Function in pdfminer.six

Open lihuaming07 opened this issue 10 months ago • 1 comments

Body:

Description

I am encountering an issue with the extract_pages function from the pdfminer.six library. While I am able to successfully retrieve text using the extract_text function, I am unable to get any LTTextBox instances using extract_pages. Interestingly, when iterating over the page elements, I can still obtain LTChar objects. This leads me to believe that the individual characters are being recognized, but for some reason, they are not being grouped into LTTextBox elements.

I have already tried setting all parameters of LAParams to 0, but the issue persists. Below is my code snippet that demonstrates the problem.

Code to Reproduce

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox
from pdfminer.layout import LAParams, LTTextContainer, LTTextBox, LTFigure, LTChar
from pdfminer.high_level import extract_pages, extract_text


def extract_text_by_layout(pdf_path):
    laparams = LAParams(
        line_overlap=0.5,  # 判断两个字符是否在同一行的阈值
        char_margin=2.0,  # 判断两个字符是否属于同一单词的间距阈值
        word_margin=0.1,  # 判断两个单词是否属于同一行的间距阈值
        line_margin=0.5,  # 判断两行是否属于同一段落的间距阈值
        boxes_flow=0.5,  # 用来处理横向流动文本的,如何将文本框(boxes)连接为行。它是一个从-1.0到1.0的浮点数,用于确定文本流的方向和方式。
    )

    # extract_pages函数接受LAParams作为参数
    for page_layout in extract_pages(pdf_path, laparams=laparams):
        for element in page_layout:
            # print(element.x0, element.y0, element.x1, element.y1)
            print(f'way1: {element}')  # 如果全是LTFigure,说明laparams参数设置不合理
            if isinstance(element, LTTextContainer):
                text = element.get_text()
                print(text)
            elif isinstance(element, LTTextContainer):
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            print(character.fontname)
                            print(character.size)


def extract_text_by_layout_2(pdf_path):
    laparams = LAParams()
    # 你可以设置一些LAParams参数,例如行距间隔,字符间距等,来改善文本提取的布局分析
    # laparams.line_overlap = 0.5
    # laparams.char_margin = 2.0
    # laparams.word_margin = 0.1
    # laparams.boxes_flow = 0.5

    # extract_pages函数接受LAParams作为参数
    for page_layout in extract_pages(pdf_path, laparams=laparams):
        for element in page_layout:
            # print(element.x0, element.y0, element.x1, element.y1)
            print(element)
            if isinstance(element, LTFigure):
                print("Found a figure.")
                extract_text_from_element(element)
            if isinstance(element, LTTextContainer):
                text = element.get_text()
                print(text)


def extract_text_from_element(element):
    if isinstance(element, LTTextBox):
        print(element.get_text())
    elif isinstance(element, LTFigure):
        print("Found a Child.")
        for child in element:
            extract_text_from_element(child)
    elif isinstance(element, LTChar):
        print(element.get_text())
    else:
        # print(f"Found a non-text container. {element}")
        pass


def extract_text_from_pdf(pdf_path):
    text = extract_text(pdf_path)
    return text


if __name__ == "__main__":
    # pdf_path = '无边框无水印_高中英语词汇3500.pdf'  # 你的PDF文件路径
    pdf_path = '高中英语词汇3500.pdf'  # 你的PDF文件路径
    extract_text_by_layout(pdf_path)
    print('\n' * 5)
    extract_text_by_layout_2(pdf_path)
    # 放弃了,使用Adobe Acrobat Pro DC打开,另存为docx文件,再读取
    text = extract_text_from_pdf(pdf_path)
    print(text[:10])

PDF File

The PDF I am working with can be found here: 高中英语词汇3500.pdf

Attempts to Resolve

I have tried setting all LAParams to 0, as suggested in some forums, but this did not work. I made sure that the PDF is not encrypted and does not require a password.

Environment

pdfminer.six version: [20231228] Python version: [Python 3.11] Operating System: [Windows 11]

Additional Context

See the similar issue #867. Thank you in advance for your time and assistance with this issue. I am looking forward to your prompt response and am hopeful for a resolution.

lihuaming07 avatar Mar 31 '24 03:03 lihuaming07