pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

"extract_text" doesn't output the same transformation matrix in version 3.17 as in 3.16.

Open ghbm-itk opened this issue 6 months ago • 9 comments

I'm trying to extract text from a pdf together with the position of the text. When I do it in pypdf 3.16 I get the expected result, but I don't in 3.17.

Environment

Windows-10-10.0.19045-SP0 pypdf==3.16.0, crypt_provider=('cryptography', '41.0.3'), PIL=9.5.0 AND pypdf==3.17.3, crypt_provider=('cryptography', '41.0.7'), PIL=9.5.0

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
file_path = "list.pdf"
reader = pypdf.PdfReader(file_path)

text_parts = []

def visitor(text, cm, tm, fd, fs):
    if text.strip() == "Flyttesagsnr.:":
        text_parts.append((cm, tm, text))

reader.pages[0].extract_text(visitor_text=visitor)

print(text_parts)

Unfourtunately I can't share the PDF since it's confidential. I haven't been able to declassify the document and keep the bug. I know this might make the bug hard to replicate.

Results

In version 3.17 I get:

[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], ' Flyttesagsnr.:')]

In version 3.16 I get:

[([0.75, 0.0, 0.0, -0.75, 0.0, 841.68], [1.0, 0.0, 0.0, -1.0, 448.313, 352.05], ' Flyttesagsnr.:')]

As you can see tm[4] and tm[5] are both 0 in version 3.17, which is definitely wrong.

ghbm-itk avatar Dec 20 '23 10:12 ghbm-itk

If you have a look at the changelog, you will see that there have been some changes/improvements to the text extraction in the meantime. This probably is related to these changes and most likely intended or a previous bug.

stefan6419846 avatar Dec 20 '23 10:12 stefan6419846

But 3.17 outputs a wrong answer, when 3.16 outputs the correct answer. Seems like a new bug.

ghbm-itk avatar Dec 20 '23 10:12 ghbm-itk

Are you able to pinpoint this to one of the versions in-between to further see which change actually introduced this?

stefan6419846 avatar Dec 20 '23 15:12 stefan6419846

In order to be more consistant you should use CM matrix in order to have absolute position whatever transformation is applied and not TM which should be considered as an intermediate matrix.

pubpub-zz avatar Dec 20 '23 19:12 pubpub-zz

Are you able to pinpoint this to one of the versions in-between to further see which change actually introduced this?

I will try this when I have some time.

In order to be more consistant you should use CM matrix in order to have absolute position whatever transformation is applied and not TM which should be considered as an intermediate matrix.

I don't think this is true. The actual transformation matrix is a combination of cm and tm as far as I understand. At least for the PDF I was reading here the cm was the same for all text on the page, but the tm wasn't.

ghbm-itk avatar Dec 21 '23 06:12 ghbm-itk

@stefan6419846 I tested the code snippet in different versions with the following results: 3.16.0: Correct 3.16.1: Correct 3.16.2: Correct 3.16.3: Wrong 3.17.3: Wrong

I suspect the change happened with https://github.com/py-pdf/pypdf/pull/2206

ghbm-itk avatar Dec 21 '23 08:12 ghbm-itk

I don't think this is true. The actual transformation matrix is a combination of cm and tm as far as I understand. At least for the PDF I was reading here the cm was the same for all text on the page, but the tm wasn't.

oups you are right I had to keep the existing definitions whereas it was more complex to be used.

I suspect the change happened with #2206

The change was raised because the TM was not captured at the beginning of the line. Would you accept to share the file in private, emailing it to @MartinThoma ?

pubpub-zz avatar Dec 21 '23 08:12 pubpub-zz

I'm sorry but it would be illegal for me to share the document with anyone outside my org. Is there a good way where I can remove all other text from the pdf without affecting the "Flyttesagsnr.:" text?

Whenever I try to edit the pdf, the matrices change completely.

ghbm-itk avatar Dec 21 '23 08:12 ghbm-itk

In general, there is no easy/general purpose approach to do this as far as I know. A possible way would be to manually mess with the internal page source, but this requires some deeper understanding of the PDF format.

stefan6419846 avatar Dec 24 '23 20:12 stefan6419846