PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

PdfDocumentFormatException on GetPage() call

Open cremor opened this issue 4 years ago • 7 comments

I have a PDF that throws a PdfDocumentFormatException on a GetPage() call. This happens both when opening the PDF with default settings and with LenientParsingOff.

UglyToad.PdfPig.Core.PdfDocumentFormatException
  HResult=0x80131500
  Message=Invalid type of toUnicode CMap encountered. Got: 340 0.
  Source=UglyToad.PdfPig
  StackTrace:
   at UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.Generate(DictionaryToken dictionary) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\PdfFonts\Parser\Handlers\Type0FontHandler.cs:line 82
   at UglyToad.PdfPig.PdfFonts.FontFactory.Get(DictionaryToken dictionary) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\PdfFonts\FontFactory.cs:line 44
   at UglyToad.PdfPig.Content.ResourceStore.LoadFontDictionary(DictionaryToken fontDictionary) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\Content\ResourceStore.cs:line 155
   at UglyToad.PdfPig.Content.ResourceStore.LoadResourceDictionary(DictionaryToken resourceDictionary) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\Content\ResourceStore.cs:line 46
   at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean clipPaths) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\Parser\PageFactory.cs:line 66
   at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, Boolean clipPaths) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\Content\Pages.cs:line 66
   at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\PdfDocument.cs:line 169

Before that two NullReferenceException are thrown with the following call stacks, but both are handled by PdfPig:

UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.TryGet<UglyToad.PdfPig.Tokens.StreamToken>(UglyToad.PdfPig.Tokens.IToken token, UglyToad.PdfPig.Tokenization.Scanner.IPdfTokenScanner scanner, out UglyToad.PdfPig.Tokens.StreamToken tokenResult) Line 40
UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.Generate(UglyToad.PdfPig.Tokens.DictionaryToken dictionary) Line 67
UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.TryGet<UglyToad.PdfPig.Tokens.NameToken>(UglyToad.PdfPig.Tokens.IToken token, UglyToad.PdfPig.Tokenization.Scanner.IPdfTokenScanner scanner, out UglyToad.PdfPig.Tokens.NameToken tokenResult) Line 40
UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.Generate(UglyToad.PdfPig.Tokens.DictionaryToken dictionary) Line 76

I've confirmed the exception with version 0.1.5-alpha002 and with the current master branch (at a3e316958abd40dfb48cf088b9018d513e665d24).

The error only happens on page number 1. The PDF has 77 pages and all other pages work fine with GetPage(). Sadly I can't share the whole PDF since it contains sensitve data. I could share page 1, but I couldn't yet find a tool that can extract the page in a way that still causes the error in the new PDF file.

The original PDF was created with the application "Wondershare PDFelement" by merging multiple PDFs (mostly scans). The PDF was then modified by adding multiple highlights and annotations. Page 1 was automatically created by PDFelement during the merge process. It's a table of contents (called "catalog" by PDFelement).

I don't think that the original PDF is invalid or malformed since all applications I tested - including the PDFBox Debugger app - can show it without errors.

cremor avatar Aug 05 '21 06:08 cremor

I think this is the underlying cause of your problem in #344, can you open the PDF file in Notepad++ or similar and look what the content is for the entry:

340 0 obj
// something here
endobj

EliotJones avatar Aug 09 '21 19:08 EliotJones

There is no 340 0 obj in the PDF.

If I search just for 340 0 then I only get hits where it is part of 1340 0 or 2340 0.

cremor avatar Aug 10 '21 05:08 cremor

Agree with Eliot, this is the cause for #334. Will open a PR to tweak the dictionary writing logic to put a NullToken instead of just empty in these scenarios.

plaisted avatar Aug 10 '21 18:08 plaisted

Although #359 was merged and #344 was fixed, this issue is still open. As described in the comments of #359 the PR only changed the PDF writing side of PdfPig, but this issue happens on the PDF reading side.

I just tested the current version 0.1.6-alpha-20220113-5b66e and can confirm that the issue still happens with that.

Please don't see this comment as a "push". This issue is not that important for me. I just wanted to clearify the current status.

cremor avatar Jan 13 '22 12:01 cremor

Thanks for the update, if you still have access to the source file can you check if it contains /Type /XRef or /XRef using Notepad++. I'm wondering if it's related to a bug with cross-reference streams.

EliotJones avatar Jan 13 '22 12:01 EliotJones

There are 21 hits for /Type /XRef in the file. But there are no hits for just /XRef.

cremor avatar Jan 13 '22 12:01 cremor

I believe then this might be related to https://github.com/UglyToad/PdfPig/issues/391

EliotJones avatar Jan 13 '22 14:01 EliotJones

Closing this as wontfix since there's no possibility I'll have time in the next years. If it reoccurs and you have a sample file feel free to open a new issue

EliotJones avatar Dec 11 '22 20:12 EliotJones