pypdf `PageObject.extract_text`s `text_visitor` reports a wrong matrix for some text nodes

`PageObject.extract_text`s `text_visitor` reports a wrong matrix for some text nodes

Open LukeSerne opened this issue 3 months ago • 2 comments

While trying to extract lemmas from this page, I found that some text "nodes" (not sure what the technical term is, I'll refer to them as nodes in this issue) are passed to visitor_text with seemingly wrong matrix values.

Environment

$ python -m platform
Linux-6.5.0-21-generic-x86_64-with-glibc2.35
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('cryptography', '3.4.8'), PIL=9.0.1

Code + PDF

This is a minimal, complete example that shows the issue. Observe (using a PDF reader) that the nodes ZURRA˓A, KHIRBE and T EL appear next to each other. Also save the script below (to example.py for example) and run it, passing the path to the attached pdf as first parameter.

import pypdf
import sys

def main():

    reader = pypdf.PdfReader(sys.argv[1], strict=True)
    page = reader.pages[0]

    def text_visitor(text, transform, matrix, font_dict, font_size):
        if "T EL" in text or "ZURRA˓A, KHIRBE" in text:
            print(f"{text!r} has matrix {matrix}")

    page.extract_text(visitor_text=text_visitor)

if __name__ == "__main__":
    main()

Observe that the output is:

$ python example.py ./zurra_page.pdf 
'ZURRA˓A, KHIRBE' has matrix [1.0, 0.0, 0.0, 1.0, 50.4, 687.12]
' T EL' has matrix [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]

I expected the last two elements of the T EL node to be the x and y position of the node (which pdfbox shows to be 177.92 and 687.12 respectively). I also noticed that pdfbox seems to indicate the text in the node is T EL, but pdfpy reports T EL (note the leading space). Is pdfpy mistakenly adding a leading space?

Files

The sample PDF used with this is a page from a PDF version of the Anchor Bible Dictionary: zurra_page.pdf

This page in pdfbox's debugger, which clearly shows the coordinates of the T EL node:

Traceback

There is no exception raised, so there also is no traceback.

Mar 10 '24 17:03 LukeSerne

pypdf pypdf copied to clipboard

`PageObject.extract_text`s `text_visitor` reports a wrong matrix for some text nodes

Environment

Code + PDF

Files

Traceback

pypdf
pypdf copied to clipboard