Support for PANTONE colors
Explanation
I recently stumbled upon a PDF file which would emit
Color /PANTONE 2935 C converted to Gray. Please share PDF with pypdf dev team
Apparently, we currently do not support Pantone colors, although I am not sure about the general licensing situation here.
Resources
Doing some research, I stumbled upon the following public data sources:
- Conversion chart to CMYK at https://www.risinghillmarketing.com/wp-content/uploads/2016/03/pantone-color-bridge-cmyk-pc.pdf
- https://codebeautify.org/pantone-to-cmyk-converter with a hard-coded mapping at https://codebeautify.org/dist/9.2/js/pantonJS.js, but without any source or license hint
- http://www.excaliburcreations.com/pantone.html which can only be used with explicit permission. The values appear to be off compared to above PDF and Javascript code
- https://github.com/adonald/Pantone-CMYK-RGB-Hex/blob/master/pantone_CMYK_RGB_Hex.json based upon the previous website (the repository states the MIT license, but I have some doubts about this)
I don't know how to manage this: the color definitions seems to be also protected. See page 3: https://www.pantone.com/media/downloads/customer-service/Pantone_Color_Services_Terms_and_Conditions.pdf
As it should not really hurt, I just sent them a message asking about FOSS usage. Depending on whether we receive a (positive) response, we should be able to either implement this or emit a clear message for this case to indicate that there is nothing we can do about this specific case.
As it should not really hurt, I just sent them a message asking about FOSS usage. Depending on whether we receive a (positive) response, we should be able to either implement this or emit a clear message for this case to indicate that there is nothing we can do about this specific case.
Or implement a way to pass a colors lookup table
We will have to rely on the lookup table if we want to add support for it - my request to them just got denied.
By the way: For some reason, the previous comment has been added twice, thus I deleted the duplicate.
@stefan6419846 This seems to be a general problem of getting color values from the special color spaces (like Separation, DeviceN) from the alternate color space definition, that is part of them always and probably would solve the issue. So it could help to implement a query for the alternate color space or even a color transform object that can handle PDF colors of the special color spaces (that PANTONE colors are) and convert them to RGB (or any other base color space) to understand the color intent (is it a "green" or is it a "red") in that situation. However this requires the implemention of functions types 0 to 4 according PDF spec, that includes PostScript based functions and an implementation to handle ICC profiles, what is - to be honest - a quite large project. I am currently thinking about an implemention of color spaces in general. Are there any plans to do this or is something planned/in progress?
Thanks for your comment. I am sorry, but somehow I cannot completely follow you here, possibly due to my limited general knowledge of color spaces, where you clearly have some advantages with your background.
I would prefer to keep the different aspects which might interact with this separate where possible. This means:
- This specific issue/ticket is only about PANTONE colors when extracting images (as far as I understand). The goal is to find a suitable/future-proof way to allow the user to provide a mapping of PANTONE colors to their nearest "known" color space.
- The different function types from the PDF specification would be nice to have in general (there are other use cases as well), but discussions should go into a separate issue.
- For other unsupported color spaces or full ICC support, we would need a separate issue as well. As you mention, this might become very complex rather fast for something which most of the regular users never have to deal with.
PANTONE colors are used in /Separation and /DeviceN color spaces as they are intended to be "ink" channels. That means that they MUST have (according PDF spec) an alternate color space that is one of the base color spaces (DeviceGray, DeviceRGB, DeviceCMYK, ICCbased color space (can be Gray, RGB, CMYK as well), Lab and CalRGB(actually the ancestor of Lab) ).
To make it short: it is always possible to get for instance a RGB representation of a PANTONE ink, what actually is needed here, I think.
As color spaces are fundamentals of PDF they should be represented by a high level object in pypdf. The application of their properties (color transformations) might be out of scope, because pypdf is a library and not an application ...
To make it short: it is always possible to get for instance a RGB representation of a PANTONE ink, what actually is needed here, I think.
We have to be careful about copyright: https://en.wikipedia.org/wiki/Pantone#Intellectual_property https://en.wikipedia.org/wiki/Pantone I'm not sure if we are allowed to hardcode the table color names to RGB
I remember some discussion about adding a lookup file users could reference for this cases
As color spaces are fundamentals of PDF they should be represented by a high level object in pypdf. The application of their properties (color transformations) might be out of scope, because pypdf is a library and not an application ...
Do I understand this section correctly, that in your opinion pypdf should not have to care about the color space, but just forward this to the user?
This already is the case on the basic access. The actual issue with the PDF file given in the initial report is that extracting images attempts to apply all required transformations like alpha masks and generates a Pillow image from it, requiring conversions for cases where Pillow does not support the color space or where further processing is required. For the image in question to be extracted a in (mostly) correct manner we would have to tell pypdf what "PANTONE 2935 C" maps to.
Due to how PANTONE is licensed and the usual other color spaces only providing an approximation (being somehow discouraged by PANTONE), such a mapping has to be supplied externally and cannot safely be shipped with pypdf itself.
As color spaces are fundamentals of PDF they should be represented by a high level object in pypdf. The application of their properties (color transformations) might be out of scope, because pypdf is a library and not an application ...
Do I understand this section correctly, that in your opinion pypdf should not have to care about the color space, but just forward this to the user?
Currently there is no "color space" object that is build from the color space arrays or dictionaries. It could be useful to have such an entity for the following reasons:
- get the current number of color componets and their names (e.g. one componennt, name=PANTONE 2935 C)
- get for arbitrary color component names (none of the standard RGB, CMYK, ...), e.g. PANTONE 2935 C the alternate color space that is always inside the color space array / dictioniary in the PDF file to understand the intended color value
The alternate color space is a default fallback build into the PDF sepciifcation to handel exactly such cases as described here: I have no clue about PANTONE colors and never licensed them but I need to render the PDF and show it as RGB, CMYK, ... So this problem has been forseen by the PDF specification and has been solved with the alternate color space that must be provided by the PDF writer.
So I see an advantage to wrap up all color spaces into a common "color space" entity that has common methods to ask for the fundametal properties (number of components, base color space, component names, alternate color space and values). It could even contain a convenience function that gives for any color component of a PANTONE ink in the range 0...1 an appropriate color value of the alternate color space. From my opinion this is something that is in scope of a library.
To be clear: this is the intended use of the PDF specification and there are no licenses from PANTONE needed.
Sorry, but I am still having trouble to understand what you intend here and how a separate data structure for color spaces will help. If the color space is required outside of the internal image processing code, you can always access the corresponding (lower-level) PDF objects directly.
If the general issue is just about using a provided fallback color space if PANTONE is specified, this is something which I see as a viable solution for this issue, although someone who has licensed PANTONE might still want to use the actual PANTONE color mapping instead of the fallback.
Let's take the example PDF file from above: The PANTONE color space is used here:
438 0 obj
[/DeviceN[/PANTONE#202935#20C]/DeviceCMYK 488 0 R 494 0 R]
endobj
It is referenced from the following object:
847 0 obj
<</BitsPerComponent 8/ColorSpace 438 0 R/Filter/DCTDecode/Height 380/Intent/RelativeColorimetric/Length 27387/Metadata 844 0 R/Name/X/SMask 846 0 R/Subtype/Image/Type/XObject/Width 490>>stream
...
endstream
Looking at the /DeviceCMYK references, we have a function:
488 0 obj
<</Domain[0.0 1.0]/Filter/FlateDecode/FunctionType 4/Length 123/Range[0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0]>>stream
...
endstream
Additionally, we have another reference which finally resolves to a Separation color space with another function:
494 0 obj
<</Colorants 489 0 R/Subtype/NChannel>>
endobj
489 0 obj
<</PANTONE#202935#20C 465 0 R>>
endobj
465 0 obj
[/Separation/PANTONE#202935#20C/DeviceCMYK<</C0[0.0 0.0 0.0 0.0]/C1[1.0 0.641566 0.0 0.0]/Domain[0 1]/FunctionType 2/N 1.0/Range[0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0]>>]
endobj
If I understand the discussion correctly, if these functions would be implemented in pypdf we would not need direct support for PANTONE in pypdf, as we could calculate the desired conversion values with the provided functions?
Sorry, but I am still having trouble to understand what you intend here and how a separate data structure for color spaces will help. If the color space is required outside of the internal image processing code, you can always access the corresponding (lower-level) PDF objects directly.
That's right. As I understand aspects as text extraction and page modifications are in the focus ... Nevertheless the construction and handling of "colors" in general in PDF is not simple, so a structured access and useage of colors could help.
If the general issue is just about using a provided fallback color space if PANTONE is specified, this is something which I see as a viable solution for this issue, although someone who has licensed PANTONE might still want to use the actual PANTONE color mapping instead of the fallback.
Absolutely. This is what the alternate color space is about and that's it intended use. Understanding the intended color space (e.g. a DeviceN with PANTONE colorant) will not be harmed by such a strategy. (e.g. method with param "use_alternate_color_space=False" could solve this)
If I understand the discussion correctly, if these functions would be implemented in pypdf we would not need direct support for PANTONE in pypdf, as we could calculate the desired conversion values with the provided functions?
To answer that I want to look at it from a more general perspective: Knowing PDF Spec (ISO 32000 reference) a PDF is always self-contained concerning colors and their representation in the sense of interpreting and drawing a PDF page is always possible (monitor preview, printing). This is implemented by the concept of alternate color spaces that always fall back to the widly agreed models of Gray, RGB and CMYK color spaces.
PDF covers two aspects with its color space concept:
- what is the color (or often more precise: ink) intent and
- what does it look like in one of the well known color spaces (Gray, RGB, CMYK) = alternate color space
So if we look at the special case of PANTONE, that is the origin of this issue ( :-) ) we have to consider again two aspects of how PANTONE is used today:
- PANTONE as a design color (e.g. a nice warm green) used in art work creation
- PANTONE as a pysical ink, printed on an industrial press in an inking unite, used in PDF data craeted by prepress software Both aspects are valid and are actually not hard-coded in the PDF itself but are a matter of interpretation of the user (PDF reader software).
Short: yes, we do NOT need anything from PANTONE and are able to use the intended PDF fallback for it (=alternate color space)
Nevertheless the construction and handling of "colors" in general in PDF is not simple, so a structured access and useage of colors could help.
This is a separate feature request with corresponding design choices to take if this can provide enough benefit for pypdf users.
Hi @stefan6419846 , I did an alternative implementation to extract xobj images and create PIL images from them. It covers so far Separation and DeviceN color spaces with an alternate CMYK color space only and solves the PANTONE problem by returning the PIL images in the alternate color space (for the given example PDF file above). Please watch this commit: https://github.com/py-pdf/pypdf/commit/8b0481cd915150431619b23551f01ed784b776f4 You can get one of the PANTONE images with this code:
reader = PdfReader("05983-DE-AmigoRule.pdf")
img = reader.pages[0].images["/Im33"]
img.image.save(r"*your path here*"+img.name)
Note: the XObjectWrapper class is a greate mess at the moment and I just put in the use case from the file with the PANTOE issue. It must be re-written properly to cover all encodings and color spaces ... The code in file _color.py contains a wrapper to apply functions as discussed in #3393
I will keep on going to have a complete implementation for the XObjectWrapper class, however my time is limited at the moment so I do not know when I will have a first version that covers all encodings and color spaces.
Please let me know if the new concept could be something to replace the code in _xobj_image_helpers.py in the future ...
Your link is not working, but I found the corresponding branch and diff at https://github.com/py-pdf/pypdf/compare/main...henningkoertelgmg:pypdf:MAINT-extract-img
As for implementation details, I cannot give a proper review for now as it is much easier to do in a PR anyway. Ideally, we would avoid to duplicate functionality. Since _xobject_image_helpers is internal API anyway, changing something there to make it more usable and understandable should not be an issue. As always, please consider splitting large changes into smaller ones to simplify reviews.
From my side, there is no need to enforce such changes in the next days - just take your time. Planning the next major release of pypdf and some upcoming holiday have the higher priority from my side anyway.
Your link is not working, but I found the corresponding branch and diff at main...henningkoertelgmg:pypdf:MAINT-extract-img
Sorry, link is fixed now ...