pdfcpu
pdfcpu copied to clipboard
DeviceN images are not extracted
grayscale images in pdf are not extracted. I think the problem may be that the images don't define a filter and this code: https://github.com/pdfcpu/pdfcpu/blob/04634d3a66a522775f01ba93bcbbd740915bd62d/pkg/pdfcpu/extract.go#L386-L388 is skipping the image without warning.
Low priority issue for me - but thought that the code above should at least generate a warning if skipping images.
Here's the example.pdf. It was generated by the adobe suite, which may be part of the problem.
$ pdfcpu version
pdfcpu: v0.6.0 dev
commit: 04634d3a (2024-01-25T20:46:43Z)
base : go1.21.4
$ pdfcpu images list example.pdf
pages: all
example.pdf
2 images available(16.3 MB)
Page Obj# │ Id │ Type SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │ Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━
1 15 │ Im2 │ image │ 2400 │ 3554 │ DeviceGray 1 8 │ 8.1 MB │
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━
2 28 │ Im3 │ image │ 2400 │ 3554 │ DeviceGray 1 8 │ 8.1 MB │
$ pdfcpu extract -m image example.pdf images/
extracting images from example.pdf into images
optimizing...
$ ls -l images
total 0
Interestingly, when I open the pdf in MacOS Preview, edit it (e.g. delete a page) and then save it again - this seems to add filter metadata (and change the color space 🤷 ), which allows the images to be extracted.
pdfcpu images list example-edited.pdf
Page Obj# │ Id │ Type SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │ Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
1 9 │ Im1 │ image │ 2400 │ 3554 │ ICCBased 1 8 * │ 667 KB │ FlateDecode
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
2 20 │ Im2 │ image │ 2400 │ 3554 │ ICCBased 1 8 * │ 1.5 MB │ FlateDecode
seeing a second example of exiting without processing the image or warning here: https://github.com/pdfcpu/pdfcpu/blob/98cb73bc8ea9a3dd6f4395a3c3b18bd5f13bb892/pkg/pdfcpu/image.go#L239-L241
ran into this with a different PDF, with a DeviceN colorspace image:
pdfcpu images list example2-DeviceN.pdf
1 images available(3.4 MB)
Page Obj# │ Id │ Type SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │ Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
1 13 │ Im0 │ image │ 2400 │ 3554 │ DeviceN 6 8 │ 3.4 MB │ FlateDecode
Yeah, that does not surprise me at all, Apple magic.. Thanks I take a look.
Could you provide a sample for the DeviceN colorspace img issue? 🙏🏻 The intention for ignoring these was not to interrupt any ongoing image extraction and postpone the implementation until samples were available in order to also test a particular deciding.
Yes - here's the DeviceN file example2-DeviceN.pdf
To explain a bit more about the color space of this image - it's "CMYK" + two spot colors, so 6 channels in all (im.comp=6
). Each channel contains a grayscale image.
The case statement here: https://github.com/pdfcpu/pdfcpu/blob/043541b532e5ff4a3fa214443059dcf7e8fc51ea/pkg/pdfcpu/writeImage.go#L775-L777 probably should have a default with a warning
I suspect I will need to make a custom renderDevice
for this type of thing. Does that sound right?
I think so.. I am busy in another corner, meanwhile if you want to take stab, go for it.
Your first example contains uncompressed images, the latest commit is a fix for this.
The second example is tricky, since it involves some postscript processing in order to map the 6 color components to the alternative CMYK colorspace.
The latest commit contains an uncompleted fix in a sense that at least it renders a gray image for your example 2 So processing of DeviceN colorspaces with more than 4 components remains open.
At some point I need to return to this, right now I am tied up with other issues
Thanks for handling the uncompressed case.
I can tackle the DeviceN colorspaces with more than 4 components in a new PR - I already have some code for this. The tricky part there is going to be that there are multiple output files per PDFImage
- which will probably need a types change for RenderImage()
to return []io.Reader
- along with all of its sub-fuctions
I will help out with the overall design of this once you have the rendering part working somehow. I believe this is going to be trick though, because what we'd actually need is a Postscript interpreter for Postscript functions (type 4) or did I miss anything?
this image parsing code:
https://github.com/adamgreenhall/pdfcpu/blob/1f162698f29345bf8f886b29ad6cb28b001b6cbd/pkg/pdfcpu/writeImage.go#L414-L440
is properly extracting the 6 grayscale images in the channels of the example2-DeviceN.pdf
file.
But clearly the organization of where to write the files needs to change. Ideas on how to do that? My initial thought was RenderImage()
returns []io.Reader
- along with all of its sub-fuctions, but that's going to be a lot of changes - all for this one unusual case.
Awesome! Let me take a look.
How did you figure out the necessary decoding for this? Did you take all of the following into account? Looks like your solution is hardcoded sort of..
11: offset= 5708 generation=0 types.Array
[DeviceN [Cyan Magenta Yellow Black coral light teal] DeviceCMYK (24 0 R) (25 0 R)]
15: offset= 3535694 generation=0 types.Array
[Separation coral DeviceRGB
<<
<C0, [1.00 1.00 1.00]>
<C1, [1.00 0.56 0.57]>
<Domain, [0 1]>
<FunctionType, 2>
<N, 1.00>
<Range, [0.00 1.00 0.00 1.00 0.00 1.00]>
>>
]
19: offset= 3535850 generation=0 types.Array
[Separation light teal DeviceRGB
<<
<C0, [1.00 1.00 1.00]>
<C1, [0.00 0.62 0.65]>
<Domain, [0 1]>
<FunctionType, 2>
<N, 1.00>
<Range, [0.00 1.00 0.00 1.00 0.00 1.00]>
>>
]
24: offset= 3537685 generation=0 types.StreamDict
<<
<Domain, [0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00]>
<Filter, FlateDecode>
<FunctionType, 4>
<Length, 185>
<Range, [0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00]>
>>
25: offset= 3538049 generation=0 types.Dict subType=NChannel
<<
<Colorants, (26 0 R)>
<Process, (27 0 R)>
<Subtype, NChannel>
>>
26: offset= 3538119 generation=0 types.Dict
<<
<coral, (15 0 R)>
<light teal, (19 0 R)>
>>
27: offset= 3538173 generation=0 types.Dict
<<
<ColorSpace, DeviceCMYK>
<Components, [Cyan Magenta Yellow Black]>
>>
Your code is working on the assumption, that any DeviceN color space using more than 4 components is a CMYK plus Spot To MultiGray image.
I am unsure if we can commit to this - can we?
Agree that this can't be committed as written. To get it to a place where we could merge, I think we'd want:
- a way to detect this CMYK+Spot type of image (rather than assuming that CMYK with >4 channels is automatically it). I'm not sure how to do this. Possibly the
DeviceN
colorspace plus the otherSeparation
info in the PDF (matching the extra color channel names) could be a way to decide? - a way to name/write multiple files per
PDFImage
that makes sense. I have an idea on this, but it's a little messy.
Do ^ those two make sense?
As for the encoding, I knew what the gray images should look like, and I tried <x,y,c> ordering options until the outputs looked right. I don't know that there is a spec for these InDesign generated PDFs.