pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Ligature issue when converting PDF to text

Open gargarvin opened this issue 1 year ago • 4 comments

I am having a ligature issue with this PDF. 'fi', 'fl' and 'ff' characters are returning NULL

#598 is similar to this issue.

MVCE: Code + PDF

from PyPDF2 import PdfReader

reader = PdfReader("Inspection_redacted.pdf")
for page in reader.pages:
    print(page.extract_text())

PDF

gargarvin avatar Sep 16 '22 17:09 gargarvin

I did a quick analysis on the first page. with some debug traces I've analysed the following line starting with PLUMBING SYSTEM - FAUCETS, VALVES AND CONNECTED FIXTURES: looking at the sequence : ut off ha The Font I've identified is F1. the transcoding table is the following

8 beginbfchar
<03> <0020>
<05> <0022>
<18> <0035>
<1B> <0038>
<1D> <003A>
<62> <00A0>
<E9> <0000>
<EA> <0000>
endbfchar
6 beginbfrange
<09> <16> <0026>
<24> <2C> <0041>
<2E> <3D> <004B>
<44> <4C> <0061>
<4E> <53> <006B>
<55> <5C> <0072>
endbfrange

the following codes are transcoded and added (ut of:

b'\x00X' -> u b'\x00W' -> t b'\x00\x03' -> (space) b'\x00R' -> o b'\x00\xe9' -> (\x00)
b'\x00\x03' -> (space) b'\x00K' -> h b'\x00D' -> a

when using sumatrapdf and pdfminer.six, I'm getting the same results with '\x00'. The only tool which seems to report properly (using copy-paste) is Acrobat Reader but I don't know where it is getting the results.

Help to analysis this case would be welcomed (@MartinThoma can you set the labels in accordance)

pubpub-zz avatar Sep 17 '22 08:09 pubpub-zz

Also of note - this tool seems to be able to convert the PDF successfully without using any sort of OCR.

gargarvin avatar Sep 19 '22 16:09 gargarvin

I resolved it like this, 'ff' case not work like other, that's why I replace it by chr(0).

page.extract_text().translate(str.maketrans({chr(0): 'ff', 0xFB01: 'fi', 0xFB02: 'fl', 0xFB03: 'ffi', 0xFB04: 'ffl'}))

PavelHightTower avatar Nov 28 '23 18:11 PavelHightTower

The above method seems to replace every ligature with 'ff'. I also noticed my original PDF does not load so here it is again. Inspection_redacted.pdf

gargarvin avatar Nov 28 '23 22:11 gargarvin