pdfplumber
pdfplumber copied to clipboard
I have a new problem
When I read a pdf file, I found page.extract_text() to get the texts what has a lot of dislocation, eg: the image of pdf is following:
but I get the texts what is following:
第一章 组织激励 本章与 年教材相比 无实质性变化 2021 , 。 年份 单项选择题 多项选择题 案例分析题 合计 题 分 题 分 题 分 题 分 2019 4 4 1 2 4 8 9 14 题 分 题 分 题 分 2020 3 3 1 2 ——— 4 5 题 分 题 分 题 分 2021 4 4 ——— 4 8 8 12 说明 上表数据是我们向参加当年考试的考生了解的 较为完整的数据统计 : 、 。 第一节 需要、 动机与激励
Question 1: The texts of the red bos are missing; Question 2: Some numbers and punctuation were misplaced and automatically moved to the next line;
I started with pdfplumber version 0.6.0, then I thought there was a problem with the version and upgraded to 0.71. I found that the extraction results were the same. I want to know what causes this problem!Is it the pdf file or package?
Thanks
Let me resubmit the extracted text: 第一章 组织激励 本章与 年教材相比 无实质性变化 2021 , 。 年份 单项选择题 多项选择题 案例分析题 合计 题 分 题 分 题 分 题 分 2019 4 4 1 2 4 8 9 14 题 分 题 分 题 分 2020 3 3 1 2 ——— 4 5 题 分 题 分 题 分 2021 4 4 ——— 4 8 8 12 说明 上表数据是我们向参加当年考试的考生了解的 较为完整的数据统计 : 、 。 第一节 需要、 动机与激励
Hi @Godlikemandyy Appreciate your interest in the library
- The text in red boxes that you say is missing, can you please confirm if that text is copyable or not? Maybe it is an image and not a text.
- Have you tries using x_tolerance and y_tolerance to resolve the text extraction issues?
@samkit-jain Appreciate your reply. 1、You are right. The text in red boxes is not copyable. It doesn't matter. 2、I didn't use them. How can I identify x_tolerance and y_tolerance to ensure the order in which the text is extracted.
- Thanks for checking. That's the reason pdfplumber missed reading that text.
- I usually do it by finding the bounding boxes of the characters that weren't in the proper alignment and then calculating the tolerance based on their X and Y values. If the difference between the
x1
of one character and thex0
of the next is less than or equal tox_tolerance
, they are put together else a space is added. Newline is added where thedoctop
of one character and thedoctop
of the next is less than or equal toy_tolerance
.
Closing, as this issue seems to have been resolved.