pdfrw icon indicating copy to clipboard operation
pdfrw copied to clipboard

Use latest version of objects from object streams (#1)

Open mjbryant opened this issue 6 years ago • 1 comments

Previously, when an object was parsed from an object stream and it referenced an indirect object, it'd pull the current version of that object at parse time. This means if you have an object stream that declares updated versions of two objects, the first of which references the second, the first object will have the incorrect old value for the second object. For example, if the content of an object stream is something like (formatted for clarity, and with probably incorrect offsets):

1 0 2 40 
<</Count 3 /Kids [2 0 R] /Type /Pages>>
<</Count 3 /Kids [4 0 R 5 0 R 6 0 R] /Parent 1 0 R /Type /Pages>>

The object stream here defines both objects (1, 0) and (2, 0). If this is an incremental update for (2, 0), the previous version of the code would make /Kids for (1, 0) the previous version of (2, 0). This was manifesting in several PDFs we found in the wild as incorrect page counts. The PDFs had added additional pages in incremental updates, and the old /Pages objects with incorrect kids were getting used.

I've ran this branch against all pdfrw tests and they all still pass. This includes roundtrips for lots of existing PDFs, so I'm fairly confident that it's not going to break the status quo. It also fixes several of the PDFs that broke for us on pdfrw master.

mjbryant avatar Jun 29 '19 21:06 mjbryant

Thank you. I will have some time to look at this late next month.

On Sat, Jun 29, 2019 at 4:06 PM Michael Bryant [email protected] wrote:

Previously, when an object was parsed from an object stream and it referenced an indirect object, it'd pull the current version of that object at parse time. This means if you have an object stream that declares updated versions of two objects, the first of which references the second, the first object will have the incorrect old value for the second object. For example, if the content of an object stream is something like (formatted for clarity, and with probably incorrect offsets):

1 0 2 40 <</Count 3 /Kids [2 0 R] /Type /Pages>> <</Count 3 /Kids [4 0 R 5 0 R 6 0 R] /Parent 1 0 R /Type /Pages>>

The object stream here defines both objects (1, 0) and (2, 0). If this is an incremental update for (2, 0), the previous version of the code would make /Kids for (1, 0) the previous version of (2, 0). This was manifesting in several PDFs we found in the wild as incorrect page counts. The PDFs had added additional pages in incremental updates, and the old /Pages objects with incorrect kids were getting used.

I've ran this branch against all pdfrw tests and they all still pass. This includes roundtrips for lots of existing PDFs, so I'm fairly confident that it's not going to break the status quo. It also fixes several of the PDFs that broke for us on pdfrw master.

You can view, comment on, or merge this pull request online at:

https://github.com/pmaupin/pdfrw/pull/169 Commit Summary

  • Use latest version of objects from object streams (#1)

File Changes

Patch Links:

  • https://github.com/pmaupin/pdfrw/pull/169.patch
  • https://github.com/pmaupin/pdfrw/pull/169.diff

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pmaupin/pdfrw/pull/169?email_source=notifications&email_token=AASE2NRWIUTN3RYLTILWOCLP47FF5A5CNFSM4H4LHJSKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G4OZCAA, or mute the thread https://github.com/notifications/unsubscribe-auth/AASE2NUKTJHOM2LSDSSUVBLP47FF5ANCNFSM4H4LHJSA .

pmaupin avatar Jun 29 '19 21:06 pmaupin