pdfminer.six
pdfminer.six copied to clipboard
Issue with extract_pages Function in pdfminer.six
Body:
Description
I am encountering an issue with the extract_pages function from the pdfminer.six library. While I am able to successfully retrieve text using the extract_text function, I am unable to get any LTTextBox instances using extract_pages. Interestingly, when iterating over the page elements, I can still obtain LTChar objects. This leads me to believe that the individual characters are being recognized, but for some reason, they are not being grouped into LTTextBox elements.
I have already tried setting all parameters of LAParams to 0, but the issue persists. Below is my code snippet that demonstrates the problem.
Code to Reproduce
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox
from pdfminer.layout import LAParams, LTTextContainer, LTTextBox, LTFigure, LTChar
from pdfminer.high_level import extract_pages, extract_text
def extract_text_by_layout(pdf_path):
laparams = LAParams(
line_overlap=0.5, # 判断两个字符是否在同一行的阈值
char_margin=2.0, # 判断两个字符是否属于同一单词的间距阈值
word_margin=0.1, # 判断两个单词是否属于同一行的间距阈值
line_margin=0.5, # 判断两行是否属于同一段落的间距阈值
boxes_flow=0.5, # 用来处理横向流动文本的,如何将文本框(boxes)连接为行。它是一个从-1.0到1.0的浮点数,用于确定文本流的方向和方式。
)
# extract_pages函数接受LAParams作为参数
for page_layout in extract_pages(pdf_path, laparams=laparams):
for element in page_layout:
# print(element.x0, element.y0, element.x1, element.y1)
print(f'way1: {element}') # 如果全是LTFigure,说明laparams参数设置不合理
if isinstance(element, LTTextContainer):
text = element.get_text()
print(text)
elif isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
print(character.fontname)
print(character.size)
def extract_text_by_layout_2(pdf_path):
laparams = LAParams()
# 你可以设置一些LAParams参数,例如行距间隔,字符间距等,来改善文本提取的布局分析
# laparams.line_overlap = 0.5
# laparams.char_margin = 2.0
# laparams.word_margin = 0.1
# laparams.boxes_flow = 0.5
# extract_pages函数接受LAParams作为参数
for page_layout in extract_pages(pdf_path, laparams=laparams):
for element in page_layout:
# print(element.x0, element.y0, element.x1, element.y1)
print(element)
if isinstance(element, LTFigure):
print("Found a figure.")
extract_text_from_element(element)
if isinstance(element, LTTextContainer):
text = element.get_text()
print(text)
def extract_text_from_element(element):
if isinstance(element, LTTextBox):
print(element.get_text())
elif isinstance(element, LTFigure):
print("Found a Child.")
for child in element:
extract_text_from_element(child)
elif isinstance(element, LTChar):
print(element.get_text())
else:
# print(f"Found a non-text container. {element}")
pass
def extract_text_from_pdf(pdf_path):
text = extract_text(pdf_path)
return text
if __name__ == "__main__":
# pdf_path = '无边框无水印_高中英语词汇3500.pdf' # 你的PDF文件路径
pdf_path = '高中英语词汇3500.pdf' # 你的PDF文件路径
extract_text_by_layout(pdf_path)
print('\n' * 5)
extract_text_by_layout_2(pdf_path)
# 放弃了,使用Adobe Acrobat Pro DC打开,另存为docx文件,再读取
text = extract_text_from_pdf(pdf_path)
print(text[:10])
PDF File
The PDF I am working with can be found here: 高中英语词汇3500.pdf
Attempts to Resolve
I have tried setting all LAParams to 0, as suggested in some forums, but this did not work. I made sure that the PDF is not encrypted and does not require a password.
Environment
pdfminer.six version: [20231228] Python version: [Python 3.11] Operating System: [Windows 11]
Additional Context
See the similar issue #867. Thank you in advance for your time and assistance with this issue. I am looking forward to your prompt response and am hopeful for a resolution.