Pillow RGBA PNG saved as PDF renders incorrectly in some applications

What did you do?

Convert a PNG file (screenshots created on Gnome and on Windows) to PDF.

What did you expect to happen?

The rendered PDF looks correct.

What actually happened?

The output looks correct in MuPDF and Chromium, but incorrect in Evince and pdf.js.

What are your OS, Python and Pillow versions?

OS: OpenSUSE Leap 15.4
Python: 3.9.18
Pillow: 10.3.0

--------------------------------------------------------------------
Pillow 10.3.0
Python 3.9.18 (main, Sep 06 2023, 07:49:32) [GCC]
--------------------------------------------------------------------
Python executable is /home/stefan/tmp/venv1/bin/python3
Environment Python files loaded from /home/stefan/tmp/venv1
System Python files loaded from /usr
--------------------------------------------------------------------
Python Pillow modules loaded from /home/stefan/tmp/venv1/lib64/python3.9/site-packages/PIL
Binary Pillow modules loaded from /home/stefan/tmp/venv1/lib64/python3.9/site-packages/PIL
--------------------------------------------------------------------
--- PIL CORE support ok, compiled for 10.3.0
*** TKINTER support not installed
--- FREETYPE2 support ok, loaded 2.13.2
--- LITTLECMS2 support ok, loaded 2.16
--- WEBP support ok, loaded 1.3.2
--- WEBP Transparency support ok
--- WEBPMUX support ok
--- WEBP Animation support ok
--- JPEG support ok, compiled for libjpeg-turbo 3.0.2
--- OPENJPEG (JPEG2000) support ok, loaded 2.5.2
--- ZLIB (PNG/ZIP) support ok, loaded 1.2.11
--- LIBTIFF support ok, loaded 4.6.0
--- RAQM (Bidirectional Text) support ok, loaded 0.10.1, fribidi 1.0.10, harfbuzz 8.4.0
*** LIBIMAGEQUANT (Quantization method) support not installed
--- XCB (X protocol) support ok
--------------------------------------------------------------------

from PIL import Image

Image.open("image2vu2shjb.png").save("out2.pdf")

Input:

image2vu2shjb

Output: out2.pdf

Rendered output from Evince:

ksnip_20240522-121933

Rendered output from pdf.js:

ksnip_20240522-121949

May 22 '24 10:05 stefan6419846

Could you share the code that you're running to get that output from pdf.js?

May 22 '24 10:05 radarhere

The pdf.js output (as well as the Evince one for Evince) is just a screenshot of what I see when opening the PDF file with Firefox where pdf.js is the default viewer.

May 22 '24 10:05 stefan6419846

Pillow uses JPXDecode when saving RGBA PDFs.

Two issues have been opened with pdf.js about rendering from this filter - https://github.com/mozilla/pdf.js/issues/16782 and https://github.com/mozilla/pdf.js/issues/17416 - so I don't think this is explicitly a Pillow bug.

May 22 '24 13:05 radarhere

Thanks for the research. Given that Evince and Okular show the same behavior, it seems like this is not too uncommon for free tools.

As a consequence, it seems like using Pillow to convert RGBA images to PDF files still has its limitations (although rather on the client side) after having been unsupported previously. Thus I am still left with either pasting the RGBA image onto a white background for the conversion or avoid the color space conversion altogether.

May 22 '24 13:05 stefan6419846

As a heads up out of curiosity: Do we really need the image conversion from PNG to JPEG2000? Shouldn't we be able to just use the original PNG image inside the PDF file?

May 26 '24 18:05 stefan6419846

You can see on page 31 of https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf that JPEG2000 is used in the JPXDecode filter.

JPEG2000 is naturally supported in a way that PNGs are not. It does appear that we could split up the data into different roles and pass that to the PDF, but it's not as simple as JPEG2000 encoding.

May 26 '24 20:05 radarhere

The PdfImage class in pdfrwx embeds images with transparency using PDF image xobject's soft masks — which is how it should be embedded according to PDF specs. (disclaimer: I am the author of pdfrwx; and I should also mention that the interface of the class will substantially change in the upcoming new version, so stay tuned).

May 29 '24 16:05 sl2c

the approach of using JPX encoding is valid and in accordance with PDF spec : the issue is on definitely on pdf.js. does any want to raise up on https://github.com/mozilla/pdf.js/issues/16782 ? 😳

May 29 '24 17:05 pubpub-zz

the approach of using JPX encoding is valid and in accordance with PDF spec : the issue is on definitely on pdf.js. does any want to raise up on mozilla/pdf.js#16782 ? 😳

I am not sure about that:

Opacity and premultiplied opacity channels are associated with specific color channels. There is never more than one opacity channel (of either type) associated with a given color channel. For example, it is possible for one opacity channel to apply to the red samples and another to apply to the green and blue color channels of an RGB image. Note: The method by which the opacity information is to be used is explicitly not specified, although one possible method shows a normal blending mode.

— this is from Adobe PDF Reference v. 1.7 page 88. If this is what you're referring to then the opacity mentioned above is not the same as transparency everywhere else (PNG etc.). This feature is specific to the JPEG2000 and has made its way into the standard simply for compatibility reasons (i.e., to be able to "just embed the JPEG2000 file"). In particular, please pay attention to the note above. This note should discourage any implementer to put anything into the JPEG2000 opacity channels other than in situations where you already have a JPEG2000 file in the first place and just want to keep it in the stream in exactly the state you got it in. And even in those circumstances, one should really just split JPEG2000 into a several image XObjects (one per channel with opacity). Because if you don't then you're at the mercy of the PDF processing application to interpret the opacity however it likes.

May 29 '24 18:05 sl2c

just below :

The use of Opacity/Transparency in JPX is perfectly valid, even more, it provides capability to define 1 transparency information per a secondary channel which is not possible using masks. My understanding of the note is that the way to mix/display/order the channels and transparency. There is no specification about it. I personnally disagree with your proposal to split into multiple XObject

May 29 '24 19:05 pubpub-zz

As I said, if you start with a JPEG2000 file (which possibly has channel-specific opacity) I might see reasons to keep it all inside one /JPXDecode -encoded dictionary stream. However, when starting with any other image format I know of, including PNG, 1) there's no benefit from the possibility of using channel-specific opacity since since we don't have any to begin with; 2) there's the downside of doing something which is not fully described by the spec, as opposed to something that is fully described by it. So, given this, could you explain the logic behind the decision to encode PNGs (or any other image format with transparency besides JPEG2000, for that matter) with /JPXDecode?

Added: PNG specifically can use filters, which have their exact counter-parts in PDF (see PNG predictors in /FlateDecode). It makes all the more sense to encode PNGs as /FlateDecode.

May 29 '24 19:05 sl2c

a) I remind that the issue is identified as an issue not yet solved of pdf.js : I see no reason to change pillow to fix it b) With your approach, I load an image from a JP2 file, then I resize it and produce a pdf : the original format will have be lost so I will not produce a JPXDecode.

May 29 '24 20:05 pubpub-zz

This is not an issue limited to pdf.js - It seems like most libraries/viewers/tools based upon poppler and/or cairo share the same limitation. Thus it depends on the internal implementation of each library.

If Pillow is able to generate PDF files which render correctly with the existing libraries while ideally allowing for clean extraction with pdfimages, mutool extract and pypdf.PdfReader.images for example, this would make the most sense in my opinion.

May 30 '24 07:05 stefan6419846

just reminding the current implementation: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#pdf

Pillow can write PDF (Acrobat) images. Such images are written as binary PDF 1.4 files. Different encoding methods are used, depending on the image mode. - 1 mode images are saved using TIFF encoding, or JPEG encoding if libtiff support is unavailable - L, RGB and CMYK mode images use JPEG encoding - P mode images use HEX encoding - LA and RGBA mode images use JPEG2000 encoding

Jun 02 '24 07:06 pubpub-zz

I would strongly suggest changing the policy on the last line to using PDF image SMask-s. This is what SMask-s have been intended for in the first place.

I also believe that for the case of "converting a JPEG2000 file to PDF" it should be possible to repackage JPEG2000 files in LA/RGBA modes as two separate files, one for the colors and one for the transparency, without the need to re-encode the pixel data, and then make a PDF image with two dictionary streams, for the image itself and for its SMask, encoded with /JPXDecode filters. But I'm not sure how relevant this issue is to Pillow in particular since looks like Pillow does not keep the original encoded JPEG2000 streams around when opening a JPEG2000 file (see https://github.com/python-pillow/Pillow/discussions/7896).

To reiterate: I believe that PDF files with images containing streams of LA/RGBA images in the /JPXDecode filter should be avoided altogether.

Jun 02 '24 07:06 sl2c

I would strongly suggest changing the policy on the last line to using PDF image SMask-s. This is what SMask-s have been intended for in the first place.

for me pillow deals with image manipulation /conversion and PDF is not an image format specially if you consider that sMask solution as 2 images. What I propose is to see how to implement such a image building in pypdf library.

Jun 02 '24 08:06 pubpub-zz

PDF treats images with SMask-s as one image. Dictionary-wise, SMask is just another entry in the image dictionary. It looks especially simple in pdfrw:

image = IndirectPdfDict()
mask = IndirectPdfDict()

mask.Subtype = PdfName.Image
mask.stream = alpha_stream

image.Subtype = PdfName.Image
image.stream = colors_stream
image.SMask = mask

That's it.

Does Pillow already have a PyPDF dependency for other things? If not, I would suggest taking a look at pdfrw as it is trivial to create low-level objects with it.

Jun 02 '24 08:06 sl2c

I would prefer Pillow just implement the SMask solution itself rather than add an external dependency. I've created #8097. See what you think.

Jun 02 '24 13:06 radarhere

Apart from avoiding a third-party dependency, pdfrw has been unmaintained for years, thus adding a new dependency on it does not really feel future-proof.

Jun 02 '24 19:06 stefan6419846

The PDF.js issue has now been resolved! https://github.com/mozilla/pdf.js/issues/16782

It will still require another release of PDF.js, and then a release of Firefox to include that, but it is a positive development, and I would rather wait for that proper fix than the workaround of my PR.

Jun 04 '24 04:06 radarhere

Although pdf.js might have a fix, it seems like poppler/cairo still has this issue and is widely used as well: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1486

Jun 04 '24 11:06 stefan6419846

poppler being used in Evince, and so is effectively the other software that you reported in the original comment.

I'm not sold on the idea that if a viewer has a bug, then Pillow needs to workaround that - it seems like a slippery slope towards accepting responsibility for the problems of every image viewer out there. Sure, sometimes if a viewer isn't displaying an image correctly, that's a sign that Pillow has made a mistake, but pdf.js accepted the bug and fixed their end, so that isn't the case here.

In the case of the image that you posted here, I don't see any transparency, so you could quite easily workaround this situation by converting the image to RGB.

Jun 05 '24 05:06 radarhere

I am aware that you surely are not responsible for the rendering of other tooling. poppler just tends to be more or less the default library for most PDF viewers as far as I am aware.

Personally, I see a general issue with converting images with an alpha channel to a fixed background, as this would require actual automated content analysis to choose the correct background color (screenshots and most common images tend to work with white indeed), but this is not directly related to this issue.

Jun 05 '24 06:06 stefan6419846

Testing with the latest version of Firefox (129.0.2), it is now rendering the PDF correctly - so the PDF.js fix has made it through.

Aug 22 '24 10:08 radarhere

Regardless of viewer issues, I feel like image transparency should generally be handled via SMasks instead of encoding to JP2. The JP2 alpha in PDF tends to be a niche hack that does not align well with the rest of the format, seeing as all other filters need external handling of transparency. So using SMasks is a lot more flexible and consistent; it allows you to choose any other compatible filter, which might be more suited to the data or the caller's needs.

Sep 01 '24 15:09 mara004

I feel like image transparency should generally be handled via SMasks instead of encoding to JP2

The practical downside to this is that #8097 increases the PDF size of this issue's image from 15,721 bytes to 26,604 bytes.

using SMasks is a lot more flexible and consistent; it allows you to choose any other compatible filter, which might be more suited to the data or the caller's needs.

I'm not following you here. Pillow currently offers a limited set of options when saving a PDF - users aren't able to arbitrarily choose their own filter (without change the image mode, that is). To what end is it helpful if the data is saved in a flexible manner? Are you planning on running other software over Pillow's PDFs, to... I don't know, remove the transparency, and that's easier if the mask is separate?

Sep 02 '24 09:09 radarhere

The practical downside to this is that #8097 increases the PDF size of this image's issue from 15,721 bytes to 26,604 bytes.

providing file sizes without specifying the file doesn't help much; 2) the use of SMask-s vs. encoding transparency in the JPXDecode stream can by itself lead to at most plus/minus a few bytes difference in size; any difference beyond that has to do with the filter you use for the stream — i.e., is due to the implementation and not the design choice.

To what end is it helpful if the data is saved in a flexible manner?

The reason SMasks have been created as a separate structure in the PDF language (as opposed to just providing support in the standard for image streams that support transparency, such as PNG, TIFF and JPEG2000) is indeed that of flexibility. Here's one example: suppose you want to produce a version of a PDF for the web, which is was originally intended for print. You'd might need to convert CMYK images to RGB, and compress those in RGB to reduce the file size. The code that does the CMYK-RGB conversion is independent of the alpha-channel and can work on the colors channel only. Same with the compression code. Moreover, the compression is normally (e.g. in Acrobat) done on colors channel only as compressing the transparency mask can (and often does) introduce visible artifacts.

In summary: there's a reason why SMask-s have been introduced, and before arguing against their utility I may suggest studying the history of the subject first; a chat with ChatGPT might be of use here.

Sep 02 '24 11:09 sl2c

The practical downside to this is that https://github.com/python-pillow/Pillow/pull/8097 increases the PDF size of this image's issue from 15,721 bytes to 26,604 bytes.

providing file sizes without specifying the file doesn't help much

I was referring to the image from the start of this issue, saved with the code from the start of this issue - from PIL import Image; Image.open("image2vu2shjb.png").save("out2.pdf")

the use of SMask-s vs. encoding transparency in the JPXDecode stream can by itself lead to at most plus/minus a few bytes difference in size; any difference beyond that has to do with the filter you use for the stream — i.e., is due to the implementation and not the design choice.

My experience in #8097 is that the SMask is a separate image, and the output is noticeably larger. If you're able to put together an implementation demonstrating that I've missed a simple way to drastically reduce the file size, feel free to prove me wrong.

suppose you want to produce a version of a PDF for the web, which is was originally intended for print. You'd might need to convert CMYK images to RGB, and compress those in RGB to reduce the file size. The code that does the CMYK-RGB conversion is independent of the alpha-channel and can work on the colors channel only

Ok, so the 'flexibility' argument is intended to help Pillow's internal function, and this isn't for the benefit of the end user? Except Pillow doesn't perform conversions like you're describing when writing the PDF images.

Moreover, the compression is normally (e.g. in Acrobat) done on colors channel only as compressing the transparency mask can (and often does) introduce visible artifacts.

If you have an example of an image that Pillow is saving imperfectly, that would be helpful to demonstrate this.

I'm not trying to argue that SMasks are bad - I'm asking what the benefit is. It doesn't seem to simplify the code, and from what I've seen so far, it increases file size, so I'm wondering why they're helpful to either Pillow or the end user. I think we need a reason to change Pillow's current behaviour by using explicit SMasks for LA/RGBA, not a reason not to change it.

Sep 02 '24 20:09 radarhere

The practical downside to this is that https://github.com/python-pillow/Pillow/pull/8097 increases the PDF size of this image's issue from 15,721 bytes to 26,604 bytes.

providing file sizes without specifying the file doesn't help much

I was referring to the image from the start of this issue, saved with https://github.com/python-pillow/Pillow/issues/8074#issue-2310154973 - from PIL import Image; Image.open("image2vu2shjb.png").save("out2.pdf")

A thing to note is that this does not only compare separate vs. inline alpha, but also JPEG vs. JPEG2000, where the latter commonly gives a better quality:compression ratio, so it'd be interesting what part of that difference might just be caused by the other codec? However, if it's true that the increase is mainly due to the separate SMask, then that's a fair point.

That said, if you don't mind the somewhat uncommon JPEG2000, why not use it for L/RGB/CMYK, too? That might be more consistent.

Sep 02 '24 21:09 mara004

That said, if you don't mind the somewhat uncommon JPEG2000, why not use it for L/RGB/CMYK, too? That might be more consistent.

I tried this with

diff --git a/src/PIL/PdfImagePlugin.py b/src/PIL/PdfImagePlugin.py
index e9c20ddc1..2d91bd924 100644
--- a/src/PIL/PdfImagePlugin.py
+++ b/src/PIL/PdfImagePlugin.py
@@ -88,7 +88,7 @@ def _write_image(
         dict_obj["ColorSpace"] = PdfParser.PdfName("DeviceGray")
         procset = "ImageB"  # grayscale
     elif im.mode == "L":
-        decode_filter = "DCTDecode"
+        decode_filter = "JPXDecode"
         # params = f"<< /Predictor 15 /Columns {width-2} >>"
         dict_obj["ColorSpace"] = PdfParser.PdfName("DeviceGray")
         procset = "ImageB"  # grayscale
@@ -116,7 +116,7 @@ def _write_image(
             image_ref = _write_image(smask, filename, existing_pdf, image_refs)[0]
             dict_obj["SMask"] = image_ref
     elif im.mode == "RGB":
-        decode_filter = "DCTDecode"
+        decode_filter = "JPXDecode"
         dict_obj["ColorSpace"] = PdfParser.PdfName("DeviceRGB")
         procset = "ImageC"  # color images
     elif im.mode == "RGBA":

import os
from PIL import Image
im = Image.open("Tests/images/hopper.png")
for mode in ("L", "RGB"):
    im.convert(mode).save(mode+".pdf")
    print(mode, os.stat(mode+".pdf").st_size)

but it actually increased the filesize.

A thing to note is that this does not only compare separate vs. inline alpha, but also JPEG vs. JPEG2000, where the latter commonly gives a better quality:compression ratio, so it'd be interesting what part of that difference might just be caused by the other codec?

Testing the image from the start of this issue with #8097 but still with JPXDecode,

diff --git a/src/PIL/PdfImagePlugin.py b/src/PIL/PdfImagePlugin.py
index 355eb55fb..1a2561bbc 100644
--- a/src/PIL/PdfImagePlugin.py
+++ b/src/PIL/PdfImagePlugin.py
@@ -106,7 +106,7 @@ def _write_image(im, filename, existing_pdf, image_refs):
         if "transparency" in im.info:
             smask = im.convert("LA")
     elif im.mode in ("RGB", "RGBA"):
-        filter = "DCTDecode"
+        filter = "JPXDecode"
         dict_obj["ColorSpace"] = PdfParser.PdfName("DeviceRGB")
         procset = "ImageC"  # color images
         if im.mode == "RGBA":
@@ -145,6 +145,9 @@ def _write_image(im, filename, existing_pdf, image_refs):
             # use a single strip
             strip_size=math.ceil(width / 8) * height,
         )
+    elif filter == "JPXDecode":
+        del dict_obj["BitsPerComponent"]
+        Image.SAVE["JPEG2000"](im, op, filename)
     elif filter == "DCTDecode":
         Image.SAVE["JPEG"](im, op, filename)
     else:

I get 25,595 bytes - so the increase is indeed primarily due to the SMask, not the change in filter.

Sep 02 '24 22:09 radarhere