PdfPig
PdfPig copied to clipboard
PdfDocumentFormatException on GetPage() call
I have a PDF that throws a PdfDocumentFormatException on a GetPage() call. This happens both when opening the PDF with default settings and with LenientParsingOff.
UglyToad.PdfPig.Core.PdfDocumentFormatException
HResult=0x80131500
Message=Invalid type of toUnicode CMap encountered. Got: 340 0.
Source=UglyToad.PdfPig
StackTrace:
at UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.Generate(DictionaryToken dictionary) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\PdfFonts\Parser\Handlers\Type0FontHandler.cs:line 82
at UglyToad.PdfPig.PdfFonts.FontFactory.Get(DictionaryToken dictionary) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\PdfFonts\FontFactory.cs:line 44
at UglyToad.PdfPig.Content.ResourceStore.LoadFontDictionary(DictionaryToken fontDictionary) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\Content\ResourceStore.cs:line 155
at UglyToad.PdfPig.Content.ResourceStore.LoadResourceDictionary(DictionaryToken resourceDictionary) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\Content\ResourceStore.cs:line 46
at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean clipPaths) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\Parser\PageFactory.cs:line 66
at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, Boolean clipPaths) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\Content\Pages.cs:line 66
at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber) in C:\Daten\Projekte\PdfPig-master\src\UglyToad.PdfPig\PdfDocument.cs:line 169
Before that two NullReferenceException are thrown with the following call stacks, but both are handled by PdfPig:
UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.TryGet<UglyToad.PdfPig.Tokens.StreamToken>(UglyToad.PdfPig.Tokens.IToken token, UglyToad.PdfPig.Tokenization.Scanner.IPdfTokenScanner scanner, out UglyToad.PdfPig.Tokens.StreamToken tokenResult) Line 40
UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.Generate(UglyToad.PdfPig.Tokens.DictionaryToken dictionary) Line 67
UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.TryGet<UglyToad.PdfPig.Tokens.NameToken>(UglyToad.PdfPig.Tokens.IToken token, UglyToad.PdfPig.Tokenization.Scanner.IPdfTokenScanner scanner, out UglyToad.PdfPig.Tokens.NameToken tokenResult) Line 40
UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.Generate(UglyToad.PdfPig.Tokens.DictionaryToken dictionary) Line 76
I've confirmed the exception with version 0.1.5-alpha002 and with the current master branch (at a3e316958abd40dfb48cf088b9018d513e665d24).
The error only happens on page number 1. The PDF has 77 pages and all other pages work fine with GetPage().
Sadly I can't share the whole PDF since it contains sensitve data. I could share page 1, but I couldn't yet find a tool that can extract the page in a way that still causes the error in the new PDF file.
The original PDF was created with the application "Wondershare PDFelement" by merging multiple PDFs (mostly scans). The PDF was then modified by adding multiple highlights and annotations. Page 1 was automatically created by PDFelement during the merge process. It's a table of contents (called "catalog" by PDFelement).
I don't think that the original PDF is invalid or malformed since all applications I tested - including the PDFBox Debugger app - can show it without errors.
I think this is the underlying cause of your problem in #344, can you open the PDF file in Notepad++ or similar and look what the content is for the entry:
340 0 obj
// something here
endobj
There is no 340 0 obj in the PDF.
If I search just for 340 0 then I only get hits where it is part of 1340 0 or 2340 0.
Agree with Eliot, this is the cause for #334. Will open a PR to tweak the dictionary writing logic to put a NullToken instead of just empty in these scenarios.
Although #359 was merged and #344 was fixed, this issue is still open. As described in the comments of #359 the PR only changed the PDF writing side of PdfPig, but this issue happens on the PDF reading side.
I just tested the current version 0.1.6-alpha-20220113-5b66e and can confirm that the issue still happens with that.
Please don't see this comment as a "push". This issue is not that important for me. I just wanted to clearify the current status.
Thanks for the update, if you still have access to the source file can you check if it contains /Type /XRef or /XRef using Notepad++. I'm wondering if it's related to a bug with cross-reference streams.
There are 21 hits for /Type /XRef in the file. But there are no hits for just /XRef.
I believe then this might be related to https://github.com/UglyToad/PdfPig/issues/391
Closing this as wontfix since there's no possibility I'll have time in the next years. If it reoccurs and you have a sample file feel free to open a new issue