pypdf
pypdf copied to clipboard
Merging of rotated pages leads to unexpected results
When I try to get width and height of the page, I encounter a problem: width and height are misplaced in SOME (but not all!) of the PDFs. The page looks like completely normal A4 portrait document, but using reader.pages[0].mediabox
returns wrong dimensions [0, 0, 841.68, 595.2]
instead of [0, 0, 595.2, 841.68]
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
# macOS-12.5.1-arm64-arm-64bit
# and also on every system on docker (Ubuntu, CentOS)
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
# 2.10.3
Code + PDF
This is a minimal, complete example that shows the issue:
from PyPDF2 import PdfReader
reader = PdfReader(path)
dimensions = reader.pages[0].mediabox
width = dimensions.width
height = dimensions.height
I removed sensitive data.
Output
![Xnapper-2022-08-26-12 11 02](https://user-images.githubusercontent.com/77989743/186870229-71949850-e32d-437c-bea8-cb29834f6105.png)
When you check all the entries in the page, you have: {'/Type': '/Page', '/Parent': IndirectObject(2, 0, 2357934091920), '/Resources': IndirectObject(4, 0, 2357934091920), '/Contents': IndirectObject(3, 0, 2357934091920), '/MediaBox': [0, 0, 841.68, 595.2], '/Rotate': 270} The page is rotated at display. the MediaBox reported is correct
The page is rotated at display. the MediaBox reported is correct @pubpub-zz
Is there a way to "rotate" it correctly? Because If I merge this PDF with my stamp, it will be misplaced and rotated. But still, the PDF itself is looking as it should, portrait orientation, no any rotation
Because this is what happens now with this PDF
From my point of view you issue is with the stamp file. My first thoughts : a) try to change your paper orientation at a different position (at 'printer' level for example) b) create a small page just to fit the stamp with a non standard file, and then merge it using translations (https://pypdf2.readthedocs.io/en/latest/user/cropping-and-transforming.html) c) use your existing file and find the adequate transformation before merging/rotating
@Nikita7x Thank you for reporting this issue :pray:
I took the freedom to adjust your code so that it uses the new syntax. You might want to check the deprecation warnings (python -Wall yourscript.py
).
I've also adjusted the title of the issue as I think it describes the issue better. Do you think so as well or did I get something wrong? If the new title is ok, maybe you can adjust the example code (I guess your're using something similar to https://pypdf2.readthedocs.io/en/latest/user/add-watermark.html )
I fully understand why the current behavior of PyPDF2 is not as expected. At the moment, PyPDF2 is essentially doing the simplest way. I think we could (and likely should) make PyPDF2 behave as you expect, but we need to make sure that we don't break existing code if we adjust it.
By the way: I love your schematic image for the current stamp behavior :heart_eyes: Did you create them yourself? May we use them in the official docs?
@pubpub-zz We could add an orientation
property to the boxes (mediabox, cropbox, bleedbox, trimbox, artbox). That would simply be the orientation of the page itself. Adding this property would come with almost no computational / memory / maintenance cost, but it might make this error source easier to spot for users. I'm undecided what I think about the idea myself... what do you think about it?
From my point of view you issue is with the stamp file.
I thought so, but it turns out, it's not about the stamp, it's about the first document. What I've tried also is:
- Rotating main document (using
.rotate
,.rotateCounterClockwise
) - Rotating the stamp file (using
.rotate
,.rotateCounterClockwise
) - Rotating main document and the stamp file
The result is always the same.
As far as I understand the origin of these files: they are being put into a scanner, but with wrong rotation. If you had an experience scanning documents, then you probably remember the guides on the scanning plate, that asks to lay paper vertically. So, some people lay paper down horizontally and after the scanning procedure it appears to be landscape. So they rotate it using software. And after that PDF “looks” normal: vertical (portrait) orientation, but on deeper lever it is still a landscape one. And when I use PyPDF2, it grabs information not from “upper level” of this document, but from “deeper level”, what causes these problems. All of this is only my suspicion and deduction.
Did you create them yourself? May we use them in the official docs?
Yeah, I made it myself in Sketch. Of course, feel free to use it everywhere
@MartinThoma Apparently the business I'm working at really relies on PyPDF2 and stamping ability. Maybe I can help somehow to resolve this issue? What can I do so that stamp will be applied in place in any document "rotation"? @pubpub-zz mentioned that
'/Rotate': 270 The page is rotated at display.
So, what can I do with this Rotate parameter? It seems like applying .rotate()
(.rotateClockwise)
rotates it, but after .merge everything is as messed up as it was, just rotated.
@Nikita7x Would you mind sharing the "Rotated PDF" (without the text) and the "Stamp" image in a bit higher resolution with me? I will create some examples + add them to the docs.
I also get confused by this and I need to give it a try. But it's important to note that PDF has two mechanisms for rotating stuff: https://pypdf2.readthedocs.io/en/latest/user/cropping-and-transforming.html#page-rotation
@MartinThoma to make a "Rotated PDF" I simply use .rotate()
function. So you can download original PDF and apply .rotate()
function.
And here comes the Stamp, this is the exact file I use. I just removed private data
stamp.pdf
And here comes the Stamp, this is the exact file I use. I just removed private data
I was talking about the sketch images :-) You could share them just as PNG; I can convert them to PDF and show the effects
Nice! I think I have an issue related to this topic. So far, I understood that when I use .rotate()
only the page display is rotated. That's the reason it doesn't work when merged. But I haven't figured out how to make it work.
It seems as if there is a bug and thus the solution I proposed doesn't work: https://github.com/py-pdf/PyPDF2/issues/1301
Thank you! I wish I could contribute to solve the issue, but my python knowledge is not enough so far. But I really appreciate the effort you and all the contributors are putting in this project.
@MartinThoma how's it going? Do you know maybe how can I fix this issue myself? My work really relies on that functionality. Maybe I need to rotate it differently somehow?
@Nikita7x Under Analysis
@Nikita7x I'm out of ideas for this one (and currently I only have the energy for simple changes in PyPDF2; this one needs more than I can give at the moment)
But you're lucky - @pubpub-zz is the top contributor to PyPDF2 :-)
this are the test files I've used: Stamp files MyStampFin.pdf : vertical page with no /Rotate MyStampFin90a.pdf : vertical displayed, landscape layout with /Rotate =90
Page to be stamped 0004.pdf : Layout page 998719.pdf : Vertical page
the use of page.rotate()
is not adequate as the flag is not taken into account in the merging.
we have to apply transformation before merging. Many points have to be considered for that:
- an error has been found in in transformation.ctm (fixed in #1341)
- be carefull about signs /Rotate is clockwise positive where as transformation are trigonometric (counterclockwise positive)
- rotate transformation center is at 0,0 therefore you need to translate before and after
- the (visual) origin 0,0 is normally on lower/left corner but when /Rotate exists the center moves to upper/left (Rotate=90), upper/right(Rotate =180), lower/right (Rotate=270) and be aware that the axis are turning simultaneously
- when applying a rotation, the cropping area is introduced in content. To prevent that the easiest is to set expand to True
with the following code, I get:
import PyPDF2
reader = PyPDF2.PdfReader("e:/Downloads/MyStampFin.pdf")
st = r.pages[0]
writer = PyPDF2.PdfWriter()
org = PyPDF2.PdfReader("e:/Downloads/0004.pdf")
page = org.pages[0]
st.add_transformation(
PyPDF2.Transformation()
.translate(-float(st.mediabox.getWidth()) / 2, -float(st.mediabox.getHeight()) / 2)
.rotate(90)
.translate(float(st.mediabox.getHeight()) / 2, float(st.mediabox.getWidth()) / 2),
True,
)
page.merge_page(st)
writer.add_page(page)
with open("e:/Downloads/MyStamp0a.pdf", "wb") as fo:
writer.write(fo)
This can also be applied with the stamp with /Rotate=90 (note the rotation with is reversed:
import PyPDF2
reader = PyPDF2.PdfReader("e:/Downloads/MyStampFin90a.pdf")
st = r.pages[0]
writer = PyPDF2.PdfWriter()
org = PyPDF2.PdfReader("e:/Downloads/998719.pdf")
page = org.pages[0]
st.add_transformation(
PyPDF2.Transformation()
.translate(-float(st.mediabox.getWidth()) / 2, -float(st.mediabox.getHeight()) / 2)
.rotate(-90)
.translate(float(st.mediabox.getHeight()) / 2, float(st.mediabox.getWidth()) / 2),
True,
)
page.merge_page(st)
writer.add_page(page)
with open("e:/Downloads/MyStamp90a.pdf", "wb") as fo:
writer.write(fo)
So @Nikita7x you should be able to get a quick solution
@MartinThoma, @MasterOdin In order to improve the API I propose 2 new functions: a) add a page.get_rotate() function : to retrieve the page /Rotate value b) add a page.transfer_rotate_to_content() function : to transfer the /Rotate flag into content (apply the good translation/rotations, change MediaBox/... and reset the translation field
Your opinion
@pubpub-zz thanks a lot for your effort! I'll try this on my examples and see if that will fix it
First of all: VERY nice analysis and explanation @pubpub-zz ! Thank you so much for that :pray:
a) add a page.get_rotate() function : to retrieve the page /Rotate value
I would prefer a rotation
property as this feels overall more consistent / cleaner to me. Especially if we just use the /Rotate
value. We could make that property even writable.
b) add a page.transfer_rotate_to_content() function
That is interesting! I was just wondering if there are other "standardizations" that people typically want to do. That function would return None
, right?
Sounds reasonable to me :+1:
I would prefer a
rotation
property as this feels overall more consistent / cleaner to me. Especially if we just use the/Rotate
value. We could make that property even writable.
I then propose to use rotation with a getter and a setter . .rotate
would be possible with .rotation += 90. Just wanting to avoid slightly different wording for getting and setting.
b) add a page.transfer_rotate_to_content() function
That is interesting! I was just wondering if there are other "standardizations" that people typically want to do. That function would return
None
, right?
Morelickely returning self to allow cascading
Wow! This is going, people. Thank you so much for your time on this problem. The last modification almost solved my issue.
I did this based on @pubpub-zz example:
import PyPDF2
a4_file = PyPDF2.PdfReader('a4_file.pdf')
a4_page = a4_file.pages[0]
blue_file = PyPDF2.PdfReader('blue_file.pdf')
blue_page = blue_file.pages[0]
pink_file = PyPDF2.PdfReader('pink_file.pdf')
pink_page = pink_file.pages[0]
pink_page.add_transformation(
PyPDF2.Transformation()
.translate(-float(pink_page.mediabox.getWidth()) / 2, -float(pink_page.mediabox.getHeight()) / 2)
.rotate(90)
.translate(float(pink_page.mediabox.getHeight()) / 2, float(pink_page.mediabox.getWidth()) / 2)
.translate(float(blue_page.mediabox.getWidth()), 0),
True,
)
a4_page.merge_page(blue_page)
a4_page.merge_page(pink_page)
writer = PyPDF2.PdfWriter()
writer.add_page(a4_page)
with open('result.pdf', 'wb') as final:
writer.write(final)
Considering I have these files:
This is the result I'd like to get:
Insted, this is the output I'm getting with the code above:
Upon further inspection on several PDF editig software, I can see the transformation is happening, but somehow there is a clipping mask on the pink_page
that cuts it. We can observe it on the file wireframe:
At this point, I feel like I'm asking too much. Sorry for that. But, if somehow, someone could take some time on this issue, it would be much appreciated.
Anyway, thank you all for your time on this package. It is awesome, and I love it!
People, I think I found a solution! 😄
At least now I know what the issue is: when transformations are applied, the page boxes '/ArtBox', '/BleedBox', '/CropBox', '/MediaBox', '/TrimBox'
are not being updated acordingly. So you have to do it manually.
Here is an example where I successfully centered an A4 page inside an A3 page on the X axis while keeping it aligned on the botton.
Code:
from PyPDF2 import PdfReader, Transformation, PdfWriter
from PyPDF2.generic import RectangleObject
a3_file = PdfReader('A3.pdf')
a3_page = a3_file.pages[0]
a4_file = PdfReader('A4.pdf')
a4_page = a4_file.pages[0]
# Rotate 'a4_page.pdf' 90 degrees and centers it on 'a3_page.pdf':
a4_page.add_transformation(
Transformation()
.translate(-float(a4_page.mediaBox.getWidth()) / 2, -float(a4_page.mediaBox.getHeight()) / 2)
.rotate(90)
.translate(float(a4_page.mediaBox.getHeight()) / 2, float(a3_page.mediaBox.getLowerLeft_y() + a4_page.mediaBox.getWidth() / 2))
)
# Manually update page boxes:
new_width = a4_page.mediaBox.getHeight()
new_heigth = a4_page.mediaBox.getWidth()
new_y = a3_page.mediaBox.getLowerLeft_y()
a4_page.update({'/ArtBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/BleedBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/CropBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/MediaBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/TrimBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a3_page.merge_page(a4_page)
writer = PdfWriter()
writer.add_page(a3_page)
with open('merged_a3.pdf', 'wb') as fo:
writer.write(fo)
@pubpub-zz thank you! Your solution with rotating the stamp file works perfectly!
@pubpub-zz thank you! Your solution with rotating the stamp file works perfectly!
@Nikita7x Have you applied the change from the PR ?
I updated PyPDF2 to a newer version which supports method **transfer_rotation_to_content**
and now I'm getting Exception while trying to use this method on pages with rotation
first_page = original_pdf.getPage(0)
first_page.transfer_rotation_to_content()
transfer_rotation_to_content
pt1 = trsf.apply_on(cast(RectangleObject, self[b]).lower_left)
AttributeError: 'ArrayObject' object has no attribute 'lower_left'
The exact file: rot_file.pdf