pypdf
pypdf copied to clipboard
BUG: Fix merge of a cropped page
tracked in #636
thee smaller box between cropBox is and trimBox(== mediaBox by default) is used
Codecov Report
Merging #879 (9cf5171) into main (759cbc3) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #879 +/- ##
=======================================
Coverage 92.15% 92.16%
=======================================
Files 24 24
Lines 4948 4950 +2
Branches 1024 1025 +1
=======================================
+ Hits 4560 4562 +2
Misses 244 244
Partials 144 144
Impacted Files | Coverage Δ | |
---|---|---|
PyPDF2/_page.py | 92.65% <100.00%> (+0.02%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
@MartinThoma , any issue with this PR ?
@pubpub-zz I simply didn't have the time to review it so far :sweat_smile:
😳
I've just fixed the merge conflicts I've caused. For the review, I still need to do those things:
- Read #636 and understand the issue
- Inspect a couple of examples of merges where this change makes a difference
This takes some time. Getting the 2.0.0-dev branch back into main has priority to me as this unblocks other changes. Still, I'm confident that this PR will be handled in the next 4 weeks.
I was reading "10.10.1 Page Boundaries":
- The media box defines the boundaries of the physical medium on which the page is to be printed. It may include any extended area surrounding the finished page for bleed, printing marks, or other such purposes. It may also include areas close to the edges of the medium that cannot be marked because of physical limitations of the output device. Content falling outside this boundary can safely be discarded without affecting the meaning of the PDF file.
- The crop box defines the region to which the contents of the page are to be clipped (cropped) when displayed or printed. Unlike the other boxes, the crop box has no defined meaning in terms of physical page geometry or intended use; it merely imposes clipping on the page contents. However, in the absence of additional information (such as imposition instructions specified in a JDF or PJTF job ticket), the crop box determines how the page’s contents are to be positioned on the output medium. The default value is the page’s media box.
- The bleed box (PDF 1.3) defines the region to which the contents of the page should be clipped when output in a production environment. [...] The default value is the page’s crop box.
- The trim box (PDF 1.3) defines the intended dimensions of the finished page after trimming. It may be smaller than the media box to allow for production-related content, such as printing instructions, cut marks, or color bars. The default value is the page’s crop box.
- The art box (PDF 1.3) defines the extent of the page’s meaningful content (including potential white space) as intended by the page’s creator. The default value is the page’s crop box.
There is also a pretty helpful image.
This PR changes the behavior of _merge_page
.
It changes which rectangle is appened to the path (re
operation; TABLE 4.9 Path construction operators).
If the page that is about to be merged has a crop box that is smaller in any dimension than its trim box, the crop box will be added. Otherwise the trim box will be added.
@pubpub-zz Why don't we simply always take the cropbox?
Pdfjam seems to also use the cropbox, but the positioning is different (I have no clue what pdfjam does there; PyPDF2 pins the mediabox to the lower-left corner. PyPDF2 essentially just uses the same coordinate system to overlay):
pdfjam crazyones.pdf git-cropped.pdf --outfile merged-pdfjam.pdf --nup "1x2"
Pdftk also seems to use the cropbox, but the positioning is different from PyPDF2 and pdfjam (pinning the merged pages cropbox to the upper right corner):
pdftk git-cropped.pdf background crazyones.pdf output merged-pdftk.pdf
@pubpub-zz Why don't we simply always take the cropbox?
@MartinThoma Update : after reviewing the code I confirm my proposal : The issue comes from cases where the Cropbox is not defined but trim yes. default values are already taken into account
The part that I'm really confused about is what if the crop box and the trim box are completely disjunct? If they don't overlap at all? What result would you expect?
It seems weird to me to switch from trim box to crop box just because one of the dimensions of the crop box is smaller. What is the logic behind that?
By the way, this is how I checked the results:
from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.generic import RectangleObject
reader = PdfReader("box.pdf")
crop_page = reader.pages[0]
print(crop_page.mediabox)
print(crop_page.cropbox)
print(crop_page.trimbox)
writer = PdfWriter()
crop_page.cropbox = RectangleObject((0, 0, 400, 400))
writer.add_page(crop_page)
with open("git-cropped.pdf", "wb") as fp:
writer.write(fp)
reader = PdfReader("crazyones.pdf")
crazy_page = reader.pages[0]
crazy_page.merge_page(crop_page)
writer = PdfWriter()
writer.add_page(crazy_page)
with open("merged-pypdf2-crop.pdf", "wb") as fp:
writer.write(fp)
@pubpub-zz I'm sorry that I didn't merge this PR so far. I'm simply really uncertain if this is doing the right thing. The test also currently succeeds (without your change). This makes it hard for me to judge.
Let's make an example:
Scenario 1
Assume the gray part is where we actually have content.
The green rectangle is the cropbox and the red one is the trimbox. What should we use for rect
and why?
Scenario 2
The green rectangle is the cropbox and the red one is the trimbox. What should we use for rect
and why?
Scenario 3
The green rectangle is the trimbox and the red one is the cropbox. What should we use for rect
and why?