PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Strange issue parsing a PDF with an OpenType font.

Open readingdancer opened this issue 1 year ago • 9 comments

My client has a bunch of PDF documents that are failing to parse and I have narrowed it down to their headings using the Semplicita Pro OpenType font from Adobe.

However, I created a test PDF using the same font and it works fine, I also removed all other content from my client's PDF and changed the heading text, re-saved it and it still fails. So I am able to provide two example documents, one that works and one that fails.

I don't know if anyone will be able to debug this, but it would be greatly appreciated it you can :)

This works: Test Doc.pdf

This fails to parse: Why does this not work.pdf

Thank you in advance!

readingdancer avatar Aug 02 '23 17:08 readingdancer

Issue happens in the TrueTypeFontParser when parsing the TrueTypeDataBytes, since it's an OpenType font. The data does not contain the necessary tables (i.e. 'head', 'hhea').

https://github.com/UglyToad/PdfPig/blob/8a82500427ace6d4dcb1b2ea7cb3fdb5e32c765d/src/UglyToad.PdfPig/PdfFonts/Parser/Parts/CidFontFactory.cs#L142C1-L147C26

It seems the font bytes are actually parsable with CompactFontFormatParser, but with some errors though.

I've attached the problematic bytes below, if someone wants to have a look too. test_unzip.txt

BobLd avatar Aug 05 '23 10:08 BobLd

@readingdancer May I asked which tool you used to create / edit the pdf files? Do you know which tool your client used?

Asking because the following might be related https://community.adobe.com/t5/adobe-fonts-discussions/opentype-ps-fonts-are-identified-as-type-1-fonts-when-saving-indesign-files-to-pdf/td-p/11852099

BobLd avatar Aug 05 '23 11:08 BobLd

I used the latest version of Adobe Acrobat and just created a standard blank PDF. Added the text and saved the PDF with the default settings.

I will ask my client and get back to you next week, my guess is they probably used MS Word, but I will confirm when I find out.

Thank you for looking into this, I really appreciate it.

readingdancer avatar Aug 05 '23 13:08 readingdancer

Ok so apparently they use an application called Foxit.

https://www.foxit.com/pdf-editor/

They told me they just save with the default settings using that app.

readingdancer avatar Aug 05 '23 14:08 readingdancer

@readingdancer thanks a lot for the information. I've created a PR that should help you process these documents https://github.com/UglyToad/PdfPig/pull/674

Once merge, let me know if that fixes your issue

BobLd avatar Aug 05 '23 14:08 BobLd

Hi @BobLd I have just quickly tested (about to board a flight!) and it has worked for the original document and the test document that was failing, thank you for you help. I will push this up to the client server over the weekend and run it against the 200 ish documents that were failing, so hopefully all will be good. I really appreciate your quick fix on this issue.

readingdancer avatar Aug 05 '23 16:08 readingdancer

@readingdancer thanks a lot for the feedback!

@EliotJones I'm pretty sure the issue is related to https://github.com/UglyToad/PdfPig/issues/554 and the analysis done by @fnatzke (checking for fontFile[0] == 0x01 && fontFile[1] == 00) seems to be relevant (i.e. the font seems to be better parsable with CompactFontFormatParser, even with errors)

BobLd avatar Aug 05 '23 19:08 BobLd

Hi @BobLd & @EliotJones - I just thought I would update you to let you know I have built a local version with your changes and have now deployed this to my client's server. ( This is running Umbraco and using https://github.com/umbraco/UmbracoExamine.PDF which in turn uses PDFPIG )

I have rebuilt my client's index of their PDF files and now we get zero reading errors and text content for all the files is now successfully in the index.

Thank you for your help fixing this, I will be keeping an eye out for your next official release so that I can do a PR to update the Umbraco Examine PDF package too :)

P.s. I wasn't sure if you'd like me to close this ticket... feel free to do so.

readingdancer avatar Aug 07 '23 14:08 readingdancer

@readingdancer thanks a lot for the update, much appreciated! And very glad to see how PdfPig is used

Let's leave the ticket open for the moment as there's still an issue behind the scene

BobLd avatar Aug 09 '23 07:08 BobLd