pdfplumber I have a new problem

I have a new problem

Open Godlikemandyy opened this issue 2 years ago • 4 comments

When I read a pdf file, I found page.extract_text() to get the texts what has a lot of dislocation, eg: the image of pdf is following:

but I get the texts what is following: 第一章　组织激励本章与年教材相比无实质性变化 2021 , 。年份单项选择题多项选择题案例分析题合计题分题分题分题分 2019 4 4 1 2 4 8 9 14 题分题分题分 2020 3 3 1 2 ——— 4 5 题分题分题分 2021 4 4 ——— 4 8 8 12 说明上表数据是我们向参加当年考试的考生了解的较为完整的数据统计 : 、。第一节需要、动机与激励

Question 1: The texts of the red bos are missing; Question 2: Some numbers and punctuation were misplaced and automatically moved to the next line；

I started with pdfplumber version 0.6.0, then I thought there was a problem with the version and upgraded to 0.71. I found that the extraction results were the same. I want to know what causes this problem！Is it the pdf file or package?

Thanks

Jul 04 '22 10:07 Godlikemandyy

Let me resubmit the extracted text：第一章　组织激励本章与年教材相比无实质性变化 2021 , 。年份单项选择题多项选择题案例分析题合计题分题分题分题分 2019 4 4 1 2 4 8 9 14 题分题分题分 2020 3 3 1 2 ——— 4 5 题分题分题分 2021 4 4 ——— 4 8 8 12 说明上表数据是我们向参加当年考试的考生了解的较为完整的数据统计 : 、。第一节需要、动机与激励

Jul 04 '22 10:07 Godlikemandyy

Hi @Godlikemandyy Appreciate your interest in the library

The text in red boxes that you say is missing, can you please confirm if that text is copyable or not? Maybe it is an image and not a text.
Have you tries using x_tolerance and y_tolerance to resolve the text extraction issues?

Jul 04 '22 12:07 samkit-jain

@samkit-jain Appreciate your reply. 1、You are right. The text in red boxes is not copyable. It doesn't matter. 2、I didn't use them. How can I identify x_tolerance and y_tolerance to ensure the order in which the text is extracted.

Jul 05 '22 02:07 Godlikemandyy

Thanks for checking. That's the reason pdfplumber missed reading that text.
I usually do it by finding the bounding boxes of the characters that weren't in the proper alignment and then calculating the tolerance based on their X and Y values. If the difference between the x1 of one character and the x0 of the next is less than or equal to x_tolerance, they are put together else a space is added. Newline is added where the doctop of one character and the doctop of the next is less than or equal to y_tolerance.

Jul 11 '22 09:07 samkit-jain

Closing, as this issue seems to have been resolved.

Feb 13 '23 23:02 jsvine

pdfplumber pdfplumber copied to clipboard

I have a new problem

pdfplumber
pdfplumber copied to clipboard