pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

BUG: Fix merge of a cropped page

Open pubpub-zz opened this issue 2 years ago • 13 comments

tracked in #636

thee smaller box between cropBox is and trimBox(== mediaBox by default) is used

pubpub-zz avatar May 16 '22 09:05 pubpub-zz

Codecov Report

Merging #879 (9cf5171) into main (759cbc3) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #879   +/-   ##
=======================================
  Coverage   92.15%   92.16%           
=======================================
  Files          24       24           
  Lines        4948     4950    +2     
  Branches     1024     1025    +1     
=======================================
+ Hits         4560     4562    +2     
  Misses        244      244           
  Partials      144      144           
Impacted Files Coverage Δ
PyPDF2/_page.py 92.65% <100.00%> (+0.02%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar May 16 '22 09:05 codecov[bot]

@MartinThoma , any issue with this PR ?

pubpub-zz avatar May 22 '22 21:05 pubpub-zz

@pubpub-zz I simply didn't have the time to review it so far :sweat_smile:

MartinThoma avatar May 23 '22 05:05 MartinThoma

😳

pubpub-zz avatar May 23 '22 11:05 pubpub-zz

I've just fixed the merge conflicts I've caused. For the review, I still need to do those things:

  1. Read #636 and understand the issue
  2. Inspect a couple of examples of merges where this change makes a difference

This takes some time. Getting the 2.0.0-dev branch back into main has priority to me as this unblocks other changes. Still, I'm confident that this PR will be handled in the next 4 weeks.

MartinThoma avatar May 23 '22 20:05 MartinThoma

I was reading "10.10.1 Page Boundaries":

  • The media box defines the boundaries of the physical medium on which the page is to be printed. It may include any extended area surrounding the finished page for bleed, printing marks, or other such purposes. It may also include areas close to the edges of the medium that cannot be marked because of physical limitations of the output device. Content falling outside this boundary can safely be discarded without affecting the meaning of the PDF file.
  • The crop box defines the region to which the contents of the page are to be clipped (cropped) when displayed or printed. Unlike the other boxes, the crop box has no defined meaning in terms of physical page geometry or intended use; it merely imposes clipping on the page contents. However, in the absence of additional information (such as imposition instructions specified in a JDF or PJTF job ticket), the crop box determines how the page’s contents are to be positioned on the output medium. The default value is the page’s media box.
  • The bleed box (PDF 1.3) defines the region to which the contents of the page should be clipped when output in a production environment. [...] The default value is the page’s crop box.
  • The trim box (PDF 1.3) defines the intended dimensions of the finished page after trimming. It may be smaller than the media box to allow for production-related content, such as printing instructions, cut marks, or color bars. The default value is the page’s crop box.
  • The art box (PDF 1.3) defines the extent of the page’s meaningful content (including potential white space) as intended by the page’s creator. The default value is the page’s crop box.

There is also a pretty helpful image.

MartinThoma avatar May 26 '22 09:05 MartinThoma

This PR changes the behavior of _merge_page.

It changes which rectangle is appened to the path (re operation; TABLE 4.9 Path construction operators).

If the page that is about to be merged has a crop box that is smaller in any dimension than its trim box, the crop box will be added. Otherwise the trim box will be added.

MartinThoma avatar May 26 '22 09:05 MartinThoma

@pubpub-zz Why don't we simply always take the cropbox?

MartinThoma avatar May 26 '22 10:05 MartinThoma

Pdfjam seems to also use the cropbox, but the positioning is different (I have no clue what pdfjam does there; PyPDF2 pins the mediabox to the lower-left corner. PyPDF2 essentially just uses the same coordinate system to overlay):

pdfjam crazyones.pdf git-cropped.pdf --outfile merged-pdfjam.pdf --nup "1x2"

Pdftk also seems to use the cropbox, but the positioning is different from PyPDF2 and pdfjam (pinning the merged pages cropbox to the upper right corner):

pdftk git-cropped.pdf background crazyones.pdf output merged-pdftk.pdf

MartinThoma avatar May 26 '22 10:05 MartinThoma

@pubpub-zz Why don't we simply always take the cropbox?

@MartinThoma Update : after reviewing the code I confirm my proposal : The issue comes from cases where the Cropbox is not defined but trim yes. default values are already taken into account

pubpub-zz avatar May 26 '22 10:05 pubpub-zz

The part that I'm really confused about is what if the crop box and the trim box are completely disjunct? If they don't overlap at all? What result would you expect?

It seems weird to me to switch from trim box to crop box just because one of the dimensions of the crop box is smaller. What is the logic behind that?

MartinThoma avatar May 26 '22 14:05 MartinThoma

By the way, this is how I checked the results:

from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.generic import RectangleObject

reader = PdfReader("box.pdf")
crop_page = reader.pages[0]
print(crop_page.mediabox)
print(crop_page.cropbox)
print(crop_page.trimbox)

writer = PdfWriter()
crop_page.cropbox = RectangleObject((0, 0, 400, 400))
writer.add_page(crop_page)
with open("git-cropped.pdf", "wb") as fp:
    writer.write(fp)

reader = PdfReader("crazyones.pdf")
crazy_page = reader.pages[0]

crazy_page.merge_page(crop_page)

writer = PdfWriter()
writer.add_page(crazy_page)

with open("merged-pypdf2-crop.pdf", "wb") as fp:
    writer.write(fp)

MartinThoma avatar May 26 '22 14:05 MartinThoma

@pubpub-zz I'm sorry that I didn't merge this PR so far. I'm simply really uncertain if this is doing the right thing. The test also currently succeeds (without your change). This makes it hard for me to judge.

Let's make an example:

Scenario 1

page

Assume the gray part is where we actually have content.

The green rectangle is the cropbox and the red one is the trimbox. What should we use for rect and why?

Scenario 2

grow-2

The green rectangle is the cropbox and the red one is the trimbox. What should we use for rect and why?

Scenario 3

grow-2

The green rectangle is the trimbox and the red one is the cropbox. What should we use for rect and why?

MartinThoma avatar Jul 03 '22 15:07 MartinThoma