pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

I have a new problem

Open Godlikemandyy opened this issue 2 years ago • 4 comments

When I read a pdf file, I found page.extract_text() to get the texts what has a lot of dislocation, eg: the image of pdf is following: image

but I get the texts what is following: 第一章 组织激励 本章与 年教材相比 无实质性变化 2021 , 。 年份 单项选择题 多项选择题 案例分析题 合计 题 分 题 分 题 分 题 分 2019 4 4 1 2 4 8 9 14 题 分 题 分 题 分 2020 3 3 1 2 ——— 4 5 题 分 题 分 题 分 2021 4 4 ——— 4 8 8 12 说明 上表数据是我们向参加当年考试的考生了解的 较为完整的数据统计 : 、 。 第一节 需要、 动机与激励

Question 1: The texts of the red bos are missing; Question 2: Some numbers and punctuation were misplaced and automatically moved to the next line;

I started with pdfplumber version 0.6.0, then I thought there was a problem with the version and upgraded to 0.71. I found that the extraction results were the same. I want to know what causes this problem!Is it the pdf file or package?

Thanks

Godlikemandyy avatar Jul 04 '22 10:07 Godlikemandyy

Let me resubmit the extracted text: 第一章 组织激励 本章与 年教材相比 无实质性变化 2021 , 。 年份 单项选择题 多项选择题 案例分析题 合计 题 分 题 分 题 分 题 分 2019 4 4 1 2 4 8 9 14 题 分 题 分 题 分 2020 3 3 1 2 ——— 4 5 题 分 题 分 题 分 2021 4 4 ——— 4 8 8 12 说明 上表数据是我们向参加当年考试的考生了解的 较为完整的数据统计 : 、 。 第一节 需要、 动机与激励

Godlikemandyy avatar Jul 04 '22 10:07 Godlikemandyy

Hi @Godlikemandyy Appreciate your interest in the library

  1. The text in red boxes that you say is missing, can you please confirm if that text is copyable or not? Maybe it is an image and not a text.
  2. Have you tries using x_tolerance and y_tolerance to resolve the text extraction issues?

samkit-jain avatar Jul 04 '22 12:07 samkit-jain

@samkit-jain Appreciate your reply. 1、You are right. The text in red boxes is not copyable. It doesn't matter. 2、I didn't use them. How can I identify x_tolerance and y_tolerance to ensure the order in which the text is extracted.

Godlikemandyy avatar Jul 05 '22 02:07 Godlikemandyy

  1. Thanks for checking. That's the reason pdfplumber missed reading that text.
  2. I usually do it by finding the bounding boxes of the characters that weren't in the proper alignment and then calculating the tolerance based on their X and Y values. If the difference between the x1 of one character and the x0 of the next is less than or equal to x_tolerance, they are put together else a space is added. Newline is added where the doctop of one character and the doctop of the next is less than or equal to y_tolerance.

samkit-jain avatar Jul 11 '22 09:07 samkit-jain

Closing, as this issue seems to have been resolved.

jsvine avatar Feb 13 '23 23:02 jsvine