pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Color "convenience operators" should (per spec) also set color space

Open jsvine opened this issue 2 years ago • 1 comments

This bug report concerns the following portion of Section 4.5 of the PDF reference: "Color values are interpreted according to the current color space, another parameter of the graphics state. A PDF content stream first selects a color space by invoking the CS operator (for the stroking color) or the cs operator (for the non-stroking color). It then selects color values within that color space with the SC operator (stroking) or the sc operator (nonstroking). There are also convenience operators—G, g, RG, rg, K, and k—that select both a color space and a color value within it in a single step." (Emphasis added.)

Many PDFs use those "convenience operators," but when pdfminer.six handles those operators it only sets the color value not the color space: https://github.com/pdfminer/pdfminer.six/blob/43c8fc8557528463c99598049b7005ae96ab8084/pdfminer/pdfinterp.py#L652-L690

As a result, the value of .ncs for many LTChars is incorrect, not reflecting the current color space for the character. (Knowing the color space is necessary to correctly interpret the color value; for instance, both Lab and DeviceRGB have color values of length 3.)

This script — colortest.py.txt — demonstrates the issue using this repository's samples/ directory as a corpus, producing these results when run from the repository's root directory: results.txt

Interpreting a sample of the results:

>>> contrib/issue-00352-hash-twos-complement.pdf
DeviceGray x 0.0: 3033
DeviceGray x (0.0, 0.0, 1.0): 165

... indicates that, in contrib/issue-00352-hash-twos-complement.pdf, 3,033 characters have a color space of DeviceGray and a color value of 0.0 (which seems reasonable) while 165 characters have a color space of DeviceGray and color value of (0.0, 0.0, 1.0) (which seems clearly incorrect).

I believe that if the "convenience operator" implementations (do_RG(...)/etc.) would also set the color space this would fix the issue. A quick test suggests that it would, but I'm not quite familiar enough with the full scope of color handling in the PDF reference or this repository, so I'm somewhat hesitant to claim this definitively.

If, however, this makes sense to the repository maintainers, I could file a preliminary PR and we could take it from there.

jsvine avatar Jun 29 '22 21:06 jsvine

Makes total sense to me!

Thanks for the thorough research!

A PR would be more than welcome!

pietermarsman avatar Aug 08 '22 20:08 pietermarsman

A PR would be more than welcome!

Super, now filed!: https://github.com/pdfminer/pdfminer.six/pull/794

jsvine avatar Aug 15 '22 04:08 jsvine

Closed by #794

pietermarsman avatar Aug 18 '22 18:08 pietermarsman