PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Garbled Text Extraction

Open alexneblett opened this issue 2 years ago • 2 comments

Hi,

Thank you for this amazing component. I have run into an issue extracting text from the attached pdf. The text from page.text is garbled, but if I open the pdf in Adobe Acrobat Reader, select all, then copy paste into notepad, the pasted text is what you see in the pdf (usually both are garbled with font mapping issues, etc.). This gives me hope (hopefully not false hope) that perhaps there is a way to extract the text. To be fair, I tried a few other components and they extracted the same garbled text.

Cheers,

Alex

fc30326e-64a0-4a6e-895a-c3d4aeae2974.pdf

alexneblett avatar May 02 '22 22:05 alexneblett

Hi @alexneblett, just had a quick look at you pdf doc.

This will need to be confirmed but it seems the character data is missing, meaning the pdf doesn't 'know' which letter is which.

When I copied the text from Adobe Acrobat reader into Notepad++ I get nonsense text (see below, not sure if this is what you meant in your post or if you managed to get the actual text) image

If I'm correct and some data is missing, you will be limited with what you can do with PdfPig alone...

One possible solution to get the text is to use Optical character recognition (ORC). The main C# library is the C# wrapper for tesseract available here https://github.com/charlesw/tesseract

Would be nice if @EliotJones or someone else could check inside the pdf if it is not properly built, or if PdfPig can be improved to get the data. I guess one possible improvement would be to have the correct bounding boxes, for the moment they have height 0 and I guess each character path

BobLd avatar May 02 '22 22:05 BobLd

Hi @alexneblett, as @BobLd found when I open the file in Edge/Firefox/Adobe Acrobat Reader I only get the 'nonsense' content by copying. Is it possible you're using a version of Acrobat that does some OCR or something?

Inspecting the content of the file in iText RUPS it looks like all the fonts in the file are lacking proper Encoding dictionaries and instead just contain Type3 fonts (which represent letters as Postscript path-painting operations with no semantic meaning). Unless a special version of Adobe has some way to interpret the Postscript operations and work out which characters they correspond to I can't see a way any code could extract text content from this file.

For example here is a Type3 font defined in the file:

9 0 obj
<</CharProcs<</.notdef 10 0 R /0 11 0 R  ... etc>>/Encoding 124 0 R /FirstChar 0/FontBBox[ 0 0 1 -1]/FontMatrix[ 1 0 0 1 0 0]/LastChar 114/Subtype/Type3/Type/Font/Widths[ 1 1 ...etc]>>
endobj

And the corresponding Encoding object:

124 0 obj
<</Differences[ 0/0/1/2/3/4/.notdef/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/21/22/.notdef/24/25/26/27/28/29/30/31/32/33/34/35/36/37/38/39/40/41/42/43/44/45/46/47/48/49/50/51/52/53/54/55/56/57/58/59/60/61/62/63/64/65/66/67/68/69/70/71/72/73/74/75/76/77/78/79/80/81/82/83/84/85/86/87/88/89/90/91/92/93/94/95/96/97/98/99/100/101/102/103/104/105/106/107/108/109/110/111/112/113/114]/Type/Encoding>>
endobj

The expected content should be a mapping of numeric values to recognized Adobe glyph names so there doesn't appear to be any way to map this back to text content unfortunately.

EliotJones avatar May 03 '22 01:05 EliotJones