PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Missing StructTreeRoot reference in 1.7 PDF

Open CourseAve-JF opened this issue 4 years ago • 4 comments

.NET 4.6.2 C# console app PdfPig 0.1.5 installed via NuGet

I'm working on an accessibility checker tool, which has to locate information in /StructTreeRoot /ParentTree. I have several 1.6 PDFs where this is working as expected. However, I have a 1.7 PDF that was generated from a Word doc saving to PDF. The /StructTreeRoot element appears to be there, however when I try to resolve the indirect reference, I get an error: 'Could not find the object with reference: 12 0.' Sure enough, when I view the PDF in PDFAnalyzer, it shows a similar thing. Curious thing is ... that in the Cross-Reference table, no object 12 exists.

When I view the PDF in a text editor, it appears that there is an xref table, along with a couple of xref streams:

xref 0 71 0000000012 65535 f 0000000017 00000 n 0000000166 00000 n 0000000222 00000 n 0000000495 00000 n 0000001778 00000 n 0000001939 00000 n 0000002165 00000 n 0000002218 00000 n 0000002271 00000 n 0000002438 00000 n 0000002670 00000 n 0000000013 65535 f 0000000014 65535 f 0000000015 65535 f

0000000063 65535 f 0000000064 65535 f 0000000065 65535 f 0000000000 65535 f 0000003793 00000 n 0000004050 00000 n 0000004265 00000 n 0000007350 00000 n 0000007395 00000 n trailer <</Size 71/Root 1 0 R/Info 11 0 R/ID[<197D8B74949BED4F93394F748EB48C61><197D8B74949BED4F93394F748EB48C61>] >> startxref 7770 %%EOF xref 0 0 trailer <</Size 71/Root 1 0 R/Info 11 0 R/ID[<197D8B74949BED4F93394F748EB48C61><197D8B74949BED4F93394F748EB48C61>] /Prev 7770/XRefStm 7395>> startxref 9346 %%EOF

Adobe Acrobat seems to open this PDF and parse out the structure information just fine. So is this some sort of edge case in the 1.7 specification, or what? Is there a way to make adjustments so that PdfPig can read the structure info? I can provide the whole doc if need be ... nothing special in it.

CourseAve-JF avatar Nov 22 '21 23:11 CourseAve-JF

After further digging into PdfPig and the CrossReferenceParser code, here is what I found:

For this PDF, the trailer consists of a combination of cross-reference tables and a cross-reference stream. It is this stream that contains the missing references for the StructTreeRoot ... and hence why my doc.Structure.GetObject(NameToken.StructTreeRoot) call fails.

Digging deeper, we find that by line 219 of CrossReferenceParser.cs, the table object contains the following relevant data:

table.parts: [0]: Type = Table, Offset = 9346, ObjectOffsets = 0, Previous = 0 [1]: Type = Stream, Offset = 7395, ObjectOffsets = 69, Previous = -1 [2]: Type = Table, Offset = 7770, ObjectOffsets = 16, Previous = 0

After the table.Build(...) call in line 219, the resulting CrossReferenceTable object only has 16 items in ObjectOffsets

If I add code such that when the CrossReferenceParser detects that there is a SteamPart in the table, instead of adding the table entry, and then the stream entry, if I instead add the stream ObjectOffsets with the table entry Offset and Previous, then the extra 69 ObjectOffsets are available in the resulting CrossReferenceTable, along with the other 16 items, and the StructTreeRoot reference is then able to be resolved.

Another alternative I though of, is in CrossReferenceTable.build, if we encounter an entry that has an XRefStm property, then to resolve that by looking up that Offset in the parts collection, and use those ObjectOffsets.

Both approaches solve the problem of ensuring that the XRefStm data ends up in the CrossReferenceTable object. However, I'm not sure if I'm over-looking anything, or if these approaches might cause other issues. I'm open for discussion. I also don't know at what point I should do a Pull Request?

CourseAve-JF avatar Dec 10 '21 22:12 CourseAve-JF

Hi @CourseAve-JF sorry for the very late reply I have been out of action with COVID and travel recently.

I think the first approach should work, if you're able to make that into a PR I can run it through my local file archive and check it doesn't cause any regressions. Thanks.

EliotJones avatar Dec 30 '21 13:12 EliotJones

Hey @EliotJones, no worries on the delayed response. This was for a project for my company ... and after getting my workaround to solve the problem, thus proving the concept of what I was able to do with your library, I then got pulled off onto a different project. However, I will be circling back to this at some point ... so yes, eventually I should be able to do a PR, and we can see if we can get this code integrated. Thanks for responding ... it'll be nicer to not have to have the fix integrated.

CourseAve-JF avatar Jan 14 '22 23:01 CourseAve-JF

I'm pretty sure I'm experiencing this issue when working with PDFs created using Word 365. Would really appreciate a fix 😅 I tried to do it myself, but I'm struggling 😞

ahkjeldsen avatar Feb 06 '22 16:02 ahkjeldsen

I think there might have been a fix here? Not sure, I'll close for now since I have zero time at the moment but if it reoccurs feel free to open an issue with a file to reproduce the issue

EliotJones avatar Dec 11 '22 20:12 EliotJones