pypdf Merging of rotated pages leads to unexpected results

Merging of rotated pages leads to unexpected results

Open Nikita7x opened this issue 1 year ago • 37 comments

When I try to get width and height of the page, I encounter a problem: width and height are misplaced in SOME (but not all!) of the PDFs. The page looks like completely normal A4 portrait document, but using reader.pages[0].mediabox returns wrong dimensions [0, 0, 841.68, 595.2] instead of [0, 0, 595.2, 841.68]

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
# macOS-12.5.1-arm64-arm-64bit
# and also on every system on docker (Ubuntu, CentOS)

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
# 2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader

reader = PdfReader(path)
dimensions = reader.pages[0].mediabox

width = dimensions.width
height = dimensions.height

dim.pdf

I removed sensitive data.

Output

Aug 26 '22 09:08 Nikita7x

When you check all the entries in the page, you have: {'/Type': '/Page', '/Parent': IndirectObject(2, 0, 2357934091920), '/Resources': IndirectObject(4, 0, 2357934091920), '/Contents': IndirectObject(3, 0, 2357934091920), '/MediaBox': [0, 0, 841.68, 595.2], '/Rotate': 270} The page is rotated at display. the MediaBox reported is correct

Aug 26 '22 10:08 pubpub-zz

The page is rotated at display. the MediaBox reported is correct @pubpub-zz

Is there a way to "rotate" it correctly? Because If I merge this PDF with my stamp, it will be misplaced and rotated. But still, the PDF itself is looking as it should, portrait orientation, no any rotation

Aug 26 '22 11:08 Nikita7x

Because this is what happens now with this PDF Снимок экрана 2022-08-26 в 14 44 56

Aug 26 '22 11:08 Nikita7x

From my point of view you issue is with the stamp file. My first thoughts : a) try to change your paper orientation at a different position (at 'printer' level for example) b) create a small page just to fit the stamp with a non standard file, and then merge it using translations (https://pypdf2.readthedocs.io/en/latest/user/cropping-and-transforming.html) c) use your existing file and find the adequate transformation before merging/rotating

Aug 27 '22 07:08 pubpub-zz

@Nikita7x Thank you for reporting this issue :pray:

I took the freedom to adjust your code so that it uses the new syntax. You might want to check the deprecation warnings (python -Wall yourscript.py).

I've also adjusted the title of the issue as I think it describes the issue better. Do you think so as well or did I get something wrong? If the new title is ok, maybe you can adjust the example code (I guess your're using something similar to https://pypdf2.readthedocs.io/en/latest/user/add-watermark.html )

I fully understand why the current behavior of PyPDF2 is not as expected. At the moment, PyPDF2 is essentially doing the simplest way. I think we could (and likely should) make PyPDF2 behave as you expect, but we need to make sure that we don't break existing code if we adjust it.

Aug 27 '22 10:08 MartinThoma

By the way: I love your schematic image for the current stamp behavior :heart_eyes: Did you create them yourself? May we use them in the official docs?

Aug 27 '22 10:08 MartinThoma

@pubpub-zz We could add an orientation property to the boxes (mediabox, cropbox, bleedbox, trimbox, artbox). That would simply be the orientation of the page itself. Adding this property would come with almost no computational / memory / maintenance cost, but it might make this error source easier to spot for users. I'm undecided what I think about the idea myself... what do you think about it?

Aug 27 '22 10:08 MartinThoma

From my point of view you issue is with the stamp file.

I thought so, but it turns out, it's not about the stamp, it's about the first document. What I've tried also is:

Rotating main document (using .rotate, .rotateCounterClockwise)
Rotating the stamp file (using .rotate, .rotateCounterClockwise)
Rotating main document and the stamp file

The result is always the same. Снимок экрана 2022-08-27 в 14 24 11

Aug 27 '22 11:08 Nikita7x

As far as I understand the origin of these files: they are being put into a scanner, but with wrong rotation. If you had an experience scanning documents, then you probably remember the guides on the scanning plate, that asks to lay paper vertically. So, some people lay paper down horizontally and after the scanning procedure it appears to be landscape. So they rotate it using software. And after that PDF “looks” normal: vertical (portrait) orientation, but on deeper lever it is still a landscape one. And when I use PyPDF2, it grabs information not from “upper level” of this document, but from “deeper level”, what causes these problems. All of this is only my suspicion and deduction.

Aug 27 '22 12:08 Nikita7x

Did you create them yourself? May we use them in the official docs?

Yeah, I made it myself in Sketch. Of course, feel free to use it everywhere

Aug 27 '22 12:08 Nikita7x

@MartinThoma Apparently the business I'm working at really relies on PyPDF2 and stamping ability. Maybe I can help somehow to resolve this issue? What can I do so that stamp will be applied in place in any document "rotation"? @pubpub-zz mentioned that

'/Rotate': 270 The page is rotated at display.

So, what can I do with this Rotate parameter? It seems like applying .rotate() (.rotateClockwise) rotates it, but after .merge everything is as messed up as it was, just rotated.

Aug 27 '22 18:08 Nikita7x

@Nikita7x Would you mind sharing the "Rotated PDF" (without the text) and the "Stamp" image in a bit higher resolution with me? I will create some examples + add them to the docs.

I also get confused by this and I need to give it a try. But it's important to note that PDF has two mechanisms for rotating stuff: https://pypdf2.readthedocs.io/en/latest/user/cropping-and-transforming.html#page-rotation

Aug 27 '22 19:08 MartinThoma

@MartinThoma to make a "Rotated PDF" I simply use .rotate() function. So you can download original PDF and apply .rotate() function. And here comes the Stamp, this is the exact file I use. I just removed private data stamp.pdf

Aug 27 '22 19:08 Nikita7x

And here comes the Stamp, this is the exact file I use. I just removed private data

I was talking about the sketch images :-) You could share them just as PNG; I can convert them to PDF and show the effects

Aug 27 '22 20:08 MartinThoma

archive.zip

Aug 27 '22 20:08 Nikita7x

Nice! I think I have an issue related to this topic. So far, I understood that when I use .rotate() only the page display is rotated. That's the reason it doesn't work when merged. But I haven't figured out how to make it work.

Aug 29 '22 12:08 carecavoador

It seems as if there is a bug and thus the solution I proposed doesn't work: https://github.com/py-pdf/PyPDF2/issues/1301

Aug 29 '22 17:08 MartinThoma

Thank you! I wish I could contribute to solve the issue, but my python knowledge is not enough so far. But I really appreciate the effort you and all the contributors are putting in this project.

Aug 29 '22 18:08 carecavoador

@MartinThoma how's it going? Do you know maybe how can I fix this issue myself? My work really relies on that functionality. Maybe I need to rotate it differently somehow?

Sep 06 '22 13:09 Nikita7x

@Nikita7x Under Analysis

Sep 08 '22 17:09 pubpub-zz

@Nikita7x I'm out of ideas for this one (and currently I only have the energy for simple changes in PyPDF2; this one needs more than I can give at the moment)

But you're lucky - @pubpub-zz is the top contributor to PyPDF2 :-)

Sep 08 '22 18:09 MartinThoma

this are the test files I've used: Stamp files MyStampFin.pdf : vertical page with no /Rotate MyStampFin90a.pdf : vertical displayed, landscape layout with /Rotate =90

Page to be stamped 0004.pdf : Layout page 998719.pdf : Vertical page

the use of page.rotate() is not adequate as the flag is not taken into account in the merging. we have to apply transformation before merging. Many points have to be considered for that:

an error has been found in in transformation.ctm (fixed in #1341)
be carefull about signs /Rotate is clockwise positive where as transformation are trigonometric (counterclockwise positive)
rotate transformation center is at 0,0 therefore you need to translate before and after
the (visual) origin 0,0 is normally on lower/left corner but when /Rotate exists the center moves to upper/left (Rotate=90), upper/right(Rotate =180), lower/right (Rotate=270) and be aware that the axis are turning simultaneously
when applying a rotation, the cropping area is introduced in content. To prevent that the easiest is to set expand to True

with the following code, I get:

import PyPDF2

reader = PyPDF2.PdfReader("e:/Downloads/MyStampFin.pdf")
st = r.pages[0]
writer = PyPDF2.PdfWriter()
org = PyPDF2.PdfReader("e:/Downloads/0004.pdf")
page = org.pages[0]
st.add_transformation(
    PyPDF2.Transformation()
    .translate(-float(st.mediabox.getWidth()) / 2, -float(st.mediabox.getHeight()) / 2)
    .rotate(90)
    .translate(float(st.mediabox.getHeight()) / 2, float(st.mediabox.getWidth()) / 2),
    True,
)
page.merge_page(st)
writer.add_page(page)
with open("e:/Downloads/MyStamp0a.pdf", "wb") as fo:
    writer.write(fo)

This can also be applied with the stamp with /Rotate=90 (note the rotation with is reversed:

import PyPDF2

reader = PyPDF2.PdfReader("e:/Downloads/MyStampFin90a.pdf")
st = r.pages[0]
writer = PyPDF2.PdfWriter()
org = PyPDF2.PdfReader("e:/Downloads/998719.pdf")
page = org.pages[0]
st.add_transformation(
    PyPDF2.Transformation()
    .translate(-float(st.mediabox.getWidth()) / 2, -float(st.mediabox.getHeight()) / 2)
    .rotate(-90)
    .translate(float(st.mediabox.getHeight()) / 2, float(st.mediabox.getWidth()) / 2),
    True,
)
page.merge_page(st)
writer.add_page(page)
with open("e:/Downloads/MyStamp90a.pdf", "wb") as fo:
    writer.write(fo)

So @Nikita7x you should be able to get a quick solution

@MartinThoma, @MasterOdin In order to improve the API I propose 2 new functions: a) add a page.get_rotate() function : to retrieve the page /Rotate value b) add a page.transfer_rotate_to_content() function : to transfer the /Rotate flag into content (apply the good translation/rotations, change MediaBox/... and reset the translation field

Your opinion

Sep 10 '22 15:09 pubpub-zz

@pubpub-zz thanks a lot for your effort! I'll try this on my examples and see if that will fix it

Sep 13 '22 08:09 Nikita7x

First of all: VERY nice analysis and explanation @pubpub-zz ! Thank you so much for that :pray:

a) add a page.get_rotate() function : to retrieve the page /Rotate value

I would prefer a rotation property as this feels overall more consistent / cleaner to me. Especially if we just use the /Rotate value. We could make that property even writable.

b) add a page.transfer_rotate_to_content() function

That is interesting! I was just wondering if there are other "standardizations" that people typically want to do. That function would return None, right?

Sounds reasonable to me :+1:

Sep 14 '22 04:09 MartinThoma

I would prefer a rotation property as this feels overall more consistent / cleaner to me. Especially if we just use the /Rotate value. We could make that property even writable.

I then propose to use rotation with a getter and a setter . .rotate would be possible with .rotation += 90. Just wanting to avoid slightly different wording for getting and setting.

b) add a page.transfer_rotate_to_content() function

That is interesting! I was just wondering if there are other "standardizations" that people typically want to do. That function would return None, right?

Morelickely returning self to allow cascading

Sep 14 '22 06:09 pubpub-zz

Wow! This is going, people. Thank you so much for your time on this problem. The last modification almost solved my issue.

I did this based on @pubpub-zz example:

import PyPDF2

a4_file = PyPDF2.PdfReader('a4_file.pdf')
a4_page = a4_file.pages[0]

blue_file = PyPDF2.PdfReader('blue_file.pdf')
blue_page = blue_file.pages[0]

pink_file = PyPDF2.PdfReader('pink_file.pdf')
pink_page = pink_file.pages[0]

pink_page.add_transformation(
    PyPDF2.Transformation()
    .translate(-float(pink_page.mediabox.getWidth()) / 2, -float(pink_page.mediabox.getHeight()) / 2)
    .rotate(90)
    .translate(float(pink_page.mediabox.getHeight()) / 2, float(pink_page.mediabox.getWidth()) / 2)
    .translate(float(blue_page.mediabox.getWidth()), 0),
    True,
)
a4_page.merge_page(blue_page)
a4_page.merge_page(pink_page)

writer = PyPDF2.PdfWriter()
writer.add_page(a4_page)

with open('result.pdf', 'wb') as final:
    writer.write(final)

Considering I have these files: a4_file blue_file pink_file

This is the result I'd like to get: expected

Insted, this is the output I'm getting with the code above: result

Upon further inspection on several PDF editig software, I can see the transformation is happening, but somehow there is a clipping mask on the pink_page that cuts it. We can observe it on the file wireframe: result_wf

At this point, I feel like I'm asking too much. Sorry for that. But, if somehow, someone could take some time on this issue, it would be much appreciated.

Anyway, thank you all for your time on this package. It is awesome, and I love it!

Sep 14 '22 14:09 carecavoador

People, I think I found a solution! 😄

At least now I know what the issue is: when transformations are applied, the page boxes '/ArtBox', '/BleedBox', '/CropBox', '/MediaBox', '/TrimBox' are not being updated acordingly. So you have to do it manually.

Here is an example where I successfully centered an A4 page inside an A3 page on the X axis while keeping it aligned on the botton.

Example files: A3.pdf A4.pdf

Code:

from PyPDF2 import PdfReader, Transformation, PdfWriter
from PyPDF2.generic import RectangleObject

a3_file = PdfReader('A3.pdf')
a3_page = a3_file.pages[0]

a4_file = PdfReader('A4.pdf')
a4_page = a4_file.pages[0]

# Rotate 'a4_page.pdf' 90 degrees and centers it on 'a3_page.pdf':
a4_page.add_transformation(
    Transformation()
    .translate(-float(a4_page.mediaBox.getWidth()) / 2, -float(a4_page.mediaBox.getHeight()) / 2)
    .rotate(90)
    .translate(float(a4_page.mediaBox.getHeight()) / 2, float(a3_page.mediaBox.getLowerLeft_y() + a4_page.mediaBox.getWidth() / 2))
)

# Manually update page boxes:
new_width = a4_page.mediaBox.getHeight()
new_heigth = a4_page.mediaBox.getWidth()
new_y = a3_page.mediaBox.getLowerLeft_y()

a4_page.update({'/ArtBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/BleedBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/CropBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/MediaBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/TrimBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})

a3_page.merge_page(a4_page)

writer = PdfWriter()
writer.add_page(a3_page)
with open('merged_a3.pdf', 'wb') as fo:
    writer.write(fo)

Sep 15 '22 16:09 carecavoador

@pubpub-zz thank you! Your solution with rotating the stamp file works perfectly!

Sep 16 '22 08:09 Nikita7x

@pubpub-zz thank you! Your solution with rotating the stamp file works perfectly!

@Nikita7x Have you applied the change from the PR ?

Sep 16 '22 11:09 pubpub-zz

I updated PyPDF2 to a newer version which supports method **transfer_rotation_to_content** and now I'm getting Exception while trying to use this method on pages with rotation

first_page = original_pdf.getPage(0)
first_page.transfer_rotation_to_content()

transfer_rotation_to_content

    pt1 = trsf.apply_on(cast(RectangleObject, self[b]).lower_left)
AttributeError: 'ArrayObject' object has no attribute 'lower_left'

The exact file: rot_file.pdf

Sep 19 '22 13:09 Nikita7x

pypdf pypdf copied to clipboard

Merging of rotated pages leads to unexpected results

Environment

Code + PDF

Output

pypdf
pypdf copied to clipboard