PDFsharp icon indicating copy to clipboard operation
PDFsharp copied to clipboard

Opening an existing PDF file for import ignores two of its pages

Open kenlyon opened this issue 2 years ago • 9 comments

I want to combine two existing PDFs into a new document. When I try this, two of the pages are missing. I can see that the document returned by PdfReader.Open() has only two pages instead of the expected four.

I have followed all the steps relating to producing Issue.zip which contains a working example of the problem. The only step that I couldn't fully follow was "4. Send us the zip file" as I see no email addresses or form upload options anywhere. Please let me know how I'm meant to send the file and I will do so.

Reporting an Issue Here

Expected Behavior

Opening an existing PDF file with four pages for import should result in a document with all four pages.

Actual Behavior

The document only contains the first two pages. In the old version of PDFSharp,. this failed with the error "Invalid predictor in array". The new version doesn't crash, but it doesn't include all pages either.

Steps to Reproduce the Behavior

I have Issue.zip all ready to go. Tell me where you want it.

kenlyon avatar Oct 17 '23 17:10 kenlyon

Other folks simply attach the ZIP files to their issue posts on GitHub.

ThomasHoevel avatar Oct 18 '23 06:10 ThomasHoevel

@ThomasHoevel Ah, thanks for the tip. I must have missed that. I was looking for a way to attach a file while writing the bug eport.

Here you go: Issue.zip

kenlyon avatar Oct 18 '23 13:10 kenlyon

I downloaded the file.
Nothing obviously wrong with the PDF.

It probably takes a few hours with the debugger to understand why the two pages are missing.

ThomasHoevel avatar Oct 18 '23 14:10 ThomasHoevel

The PDF specification reads:

Together, the combination of an object number and a generation number shall uniquely identify an indirect object.

The file with the issue has several duplicated object IDs. PDFsharp uses one of the objects and ignores the duplicates. By making a different choice, PDFsharp probably could find four pages instead of two.

But after hours of debugging, I have no clue how to achieve that. So I'm afraid there will be no change in PDFsharp in the near future.

ThomasHoevel avatar Oct 25 '23 07:10 ThomasHoevel

@ThomasHoevel Thanks for looking into this and providing this explanation. I will investigate how our customer is generating these files in the first place to see if we can address it there. If the file does not comply with the PDF specification then I think it's fair enough that you handle it the way you do. I'm grateful that it fails more gracefully than the previous version of PDF sharp.

kenlyon avatar Oct 25 '23 14:10 kenlyon

@kenlyon Thanks for providing the example documents.

As we regularly receive documents from our customers created by tools that take the PDF-spec not too seriously, I'm always on the hunt for "problematic" PDFs, to fix issues before one of our customers complains.

In the case of the provided documents however, i think, PDFsharp is not behaving properly. The spec says in chapter 7.5.6 (Incremental Updates):

...a file that has been updated several times contains several trailers. Because updates are appended to PDF files, 
a file may have several copies of an object with the same object identifier (object number and generation number).

And later:

When a conforming reader reads the file, it shall build its cross-reference information in such a way
that the most recent copy of each object shall be the one accessed from the file.

When reading a PDF, the library reads all trailers from back to front; that is, it reads the last (most recent) trailer and if it has a /Prev entry, it reads the trailer found there and repeats, collecting all found object-references on the way. When reading the actual objects, it takes the found references, sorts them by their ObjectID and then read them. There are some issues with this approach and the provided files (especially document1):

  • The document seems to be incrementally updated 2 times (we now have multiple objects with the same ObjectID)
  • The updated objects are stored in new ObjectStreams
  • When an object is parsed from an ObjectStream, the library keeps the first that was found, ignoring all others (see here)
  • By sorting the objects by their ObjectID, the library actually gives the oldest object preference (added objects typically have larger ObjectIDs)

The last point is actually the inverse of what the spec says.

With the mentioned pdf, the following happens:

  • read the oldest xref-stream
  • read the object-stream referenced from that xref-stream
  • read the /Pages dictionary
  • read the next-oldest xref-stream
  • read object-stream
  • do not read the newer version of the /Pages dictionary because that object already exist

I was able to fix this in a local branch by simply re-sorting the xref-streams before handling them. (newest first)

@ThomasHoevel i could provide a pull-request if you like

packdat avatar Nov 01 '23 16:11 packdat

@packdat I'll have a look when you provide a PR. I thought the parser was using the newest objects as tables where read from rear to front, starting with the newest XREF table. But it is not my code and maybe I missed something. Thanks for your efforts.

ThomasHoevel avatar Nov 02 '23 07:11 ThomasHoevel

The fix by packdat should be included in version 6.1.0 coming later this year or next year. Thanks for the feedback. Issue still exists with version 6.0.0.

ThomasHoevel avatar Nov 14 '23 07:11 ThomasHoevel

Check if this issue is related: https://github.com/empira/PDFsharp/issues/62#issue-2028913050

ThomasHoevel avatar Dec 06 '23 16:12 ThomasHoevel