pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

/uXXXXX instead of a single character in extracted text for some pdfs

Open equaeghe opened this issue 8 months ago โ€ข 3 comments

I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the extract_text function.

I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is ๐›ผ and in the derived, there is the string /u1D6FC (where indeed u+1D6FC corresponds to the italic math alpha in unicode).

I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret /u1D6FC at the appropriate point in its text extraction processing pipeline as ๐›ผ and similarly for all other such unicode characters.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.1.57-gentoo-a-x86_64-AMD_Ryzen_7_PRO_4750U_with_Radeon_Graphics-with-glibc2.37

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.5, crypt_provider=('cryptography', '41.0.4'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
import difflib

original = pypdf.PdfReader("original.pdf").pages[0].extract_text()
derived = pypdf.PdfReader("derived.pdf").pages[0].extract_text()

print(
    "\n".join(
        list(
            difflib.unified_diff(
                original.split(), derived.split(),
                fromfile="original", tofile="derived", n=0
            )
        )
    ).replace("\n\n", "\n")
)

Output:

--- original
+++ derived
@@ -52 +52 @@
-๐ด1๐ด2Y
+/u1D4341/u1D4342Y
@@ -92,3 +92,3 @@
-for๐ด2,
-with๐›ผ=1,
-๐›ฝ=0.5,๐‘ž=25
+for/u1D4342,
+with/u1D6FC=1,
+/u1D6FD=0.5,/u1D45E=25

Test pdfs:

equaeghe avatar Oct 27 '23 09:10 equaeghe

This seems to be slightly related to #2038 as well.

stefan6419846 avatar Oct 27 '23 11:10 stefan6419846

@equaeghe If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong. this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry

pubpub-zz avatar Oct 27 '23 17:10 pubpub-zz

@equaeghe If you open "derived.pdf" and try to copy the sentence with the alpha,beta characters and paste the characters, they look wrong. this is not true with "original.pdf" the issue is within the program which is doing the conversion. sorry

Sorry, but I do not understand how copy-pasting using some specific application can be an argument. It just means that the application you are using (which?) deals with this similarly as pypdf. (They both may be doing things correctly or both may have a bug.) I'm assuming it displays the pdf correctly? (I still think it is some decoding issue.)

If I use okular to view the pdfs and copy-paste a fragment including the alpha and beta, I get:

  • Original:
    with ฮฑ = 1,
    ฮฒ = 0.5, q = 25
    
  • Derived:
    with ฮฑ = 1,
    ฮฒ = 0.5, q = 25
    

So okular does what one would expect based on the visual representation of the pdf.

If I use Firefox:

  • Original
    with ๐›ผ = 1,
    ๐›ฝ = 0.5, ๐‘ž = 25
    
  • Derived:
    with ๐›ผ = 1,
    ๐›ฝ = 0.5, ๐‘ž = 25
    

So firefox does what one would expect based on the visual representation of the pdf, even keeping the italics.

If okular and firefox can get the right characters out, so should pypdf.

equaeghe avatar Oct 27 '23 18:10 equaeghe