pypdf /uXXXXX instead of a single character in extracted text for some pdfs

/uXXXXX instead of a single character in extracted text for some pdfs

Open equaeghe opened this issue 8 months ago • 3 comments

I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the extract_text function.

I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is 𝛼 and in the derived, there is the string /u1D6FC (where indeed u+1D6FC corresponds to the italic math alpha in unicode).

I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret /u1D6FC at the appropriate point in its text extraction processing pipeline as 𝛼 and similarly for all other such unicode characters.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.1.57-gentoo-a-x86_64-AMD_Ryzen_7_PRO_4750U_with_Radeon_Graphics-with-glibc2.37

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.5, crypt_provider=('cryptography', '41.0.4'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
import difflib

original = pypdf.PdfReader("original.pdf").pages[0].extract_text()
derived = pypdf.PdfReader("derived.pdf").pages[0].extract_text()

print(
    "\n".join(
        list(
            difflib.unified_diff(
                original.split(), derived.split(),
                fromfile="original", tofile="derived", n=0
            )
        )
    ).replace("\n\n", "\n")
)

Output:

--- original
+++ derived
@@ -52 +52 @@
-𝐴1𝐴2Y
+/u1D4341/u1D4342Y
@@ -92,3 +92,3 @@
-for𝐴2,
-with𝛼=1,
-𝛽=0.5,𝑞=25
+for/u1D4342,
+with/u1D6FC=1,
+/u1D6FD=0.5,/u1D45E=25

Test pdfs:

Oct 27 '23 09:10 equaeghe

This seems to be slightly related to #2038 as well.

Oct 27 '23 11:10 stefan6419846

@equaeghe If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong. this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry

Oct 27 '23 17:10 pubpub-zz

@equaeghe If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong. this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry

Sorry, but I do not understand how copy-pasting using some specific application can be an argument. It just means that the application you are using (which?) deals with this similarly as pypdf. (They both may be doing things correctly or both may have a bug.) I'm assuming it displays the pdf correctly? (I still think it is some decoding issue.)

If I use okular to view the pdfs and copy-paste a fragment including the alpha and beta, I get:

Original:
```
with α = 1,
β = 0.5, q = 25
```
Derived:
```
with α = 1,
β = 0.5, q = 25
```

So okular does what one would expect based on the visual representation of the pdf.

If I use Firefox:

Original
```
with 𝛼 = 1,
𝛽 = 0.5, 𝑞 = 25
```
Derived:
```
with 𝛼 = 1,
𝛽 = 0.5, 𝑞 = 25
```

So firefox does what one would expect based on the visual representation of the pdf, even keeping the italics.

If okular and firefox can get the right characters out, so should pypdf.

Oct 27 '23 18:10 equaeghe

pypdf pypdf copied to clipboard

/uXXXXX instead of a single character in extracted text for some pdfs

Environment

Code + PDF

pypdf
pypdf copied to clipboard