pdfcpu icon indicating copy to clipboard operation
pdfcpu copied to clipboard

DeviceN images are not extracted

Open adamgreenhall opened this issue 1 year ago • 13 comments

grayscale images in pdf are not extracted. I think the problem may be that the images don't define a filter and this code: https://github.com/pdfcpu/pdfcpu/blob/04634d3a66a522775f01ba93bcbbd740915bd62d/pkg/pdfcpu/extract.go#L386-L388 is skipping the image without warning.

Low priority issue for me - but thought that the code above should at least generate a warning if skipping images.

Here's the example.pdf. It was generated by the adobe suite, which may be part of the problem.

$ pdfcpu version
pdfcpu: v0.6.0 dev
commit: 04634d3a (2024-01-25T20:46:43Z)
base  : go1.21.4

$ pdfcpu images list  example.pdf    
pages: all

example.pdf    
2 images available(16.3 MB)
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━
   1   15 │ Im2 │ image                  │  2400 │   3554 │ DeviceGray    1   8        │ 8.1 MB │ 
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━
   2   28 │ Im3 │ image                  │  2400 │   3554 │ DeviceGray    1   8        │ 8.1 MB │ 

$ pdfcpu extract -m image example.pdf images/
extracting images from example.pdf into images
optimizing...

$ ls -l images
total 0

Interestingly, when I open the pdf in MacOS Preview, edit it (e.g. delete a page) and then save it again - this seems to add filter metadata (and change the color space 🤷 ), which allows the images to be extracted.

pdfcpu images list  example-edited.pdf
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   1    9 │ Im1 │ image                  │  2400 │   3554 │   ICCBased    1   8    *   │ 667 KB │ FlateDecode
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   2   20 │ Im2 │ image                  │  2400 │   3554 │   ICCBased    1   8    *   │ 1.5 MB │ FlateDecode

adamgreenhall avatar Jan 27 '24 17:01 adamgreenhall

seeing a second example of exiting without processing the image or warning here: https://github.com/pdfcpu/pdfcpu/blob/98cb73bc8ea9a3dd6f4395a3c3b18bd5f13bb892/pkg/pdfcpu/image.go#L239-L241

ran into this with a different PDF, with a DeviceN colorspace image:

pdfcpu images list  example2-DeviceN.pdf

1 images available(3.4 MB)
Page Obj# │ Id  │ Type  SoftMask ImgMask │ Width │ Height │ ColorSpace Comp bpc Interp │   Size │ Filters
━━━━━━━━━━┿━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━
   1   13 │ Im0 │ image                  │  2400 │   3554 │    DeviceN    6   8        │ 3.4 MB │ FlateDecode

adamgreenhall avatar Jan 27 '24 18:01 adamgreenhall

Yeah, that does not surprise me at all, Apple magic.. Thanks I take a look.

hhrutter avatar Jan 27 '24 18:01 hhrutter

Could you provide a sample for the DeviceN colorspace img issue? 🙏🏻 The intention for ignoring these was not to interrupt any ongoing image extraction and postpone the implementation until samples were available in order to also test a particular deciding.

hhrutter avatar Jan 27 '24 19:01 hhrutter

Yes - here's the DeviceN file example2-DeviceN.pdf

adamgreenhall avatar Jan 27 '24 19:01 adamgreenhall

To explain a bit more about the color space of this image - it's "CMYK" + two spot colors, so 6 channels in all (im.comp=6). Each channel contains a grayscale image.

The case statement here: https://github.com/pdfcpu/pdfcpu/blob/043541b532e5ff4a3fa214443059dcf7e8fc51ea/pkg/pdfcpu/writeImage.go#L775-L777 probably should have a default with a warning

I suspect I will need to make a custom renderDevice for this type of thing. Does that sound right?

adamgreenhall avatar Jan 27 '24 19:01 adamgreenhall

I think so.. I am busy in another corner, meanwhile if you want to take stab, go for it.

hhrutter avatar Jan 27 '24 19:01 hhrutter

Your first example contains uncompressed images, the latest commit is a fix for this.

The second example is tricky, since it involves some postscript processing in order to map the 6 color components to the alternative CMYK colorspace.

The latest commit contains an uncompleted fix in a sense that at least it renders a gray image for your example 2 So processing of DeviceN colorspaces with more than 4 components remains open.

At some point I need to return to this, right now I am tied up with other issues

hhrutter avatar Jan 31 '24 08:01 hhrutter

Thanks for handling the uncompressed case.

I can tackle the DeviceN colorspaces with more than 4 components in a new PR - I already have some code for this. The tricky part there is going to be that there are multiple output files per PDFImage - which will probably need a types change for RenderImage() to return []io.Reader - along with all of its sub-fuctions

adamgreenhall avatar Jan 31 '24 14:01 adamgreenhall

I will help out with the overall design of this once you have the rendering part working somehow. I believe this is going to be trick though, because what we'd actually need is a Postscript interpreter for Postscript functions (type 4) or did I miss anything?

hhrutter avatar Jan 31 '24 17:01 hhrutter

this image parsing code:

https://github.com/adamgreenhall/pdfcpu/blob/1f162698f29345bf8f886b29ad6cb28b001b6cbd/pkg/pdfcpu/writeImage.go#L414-L440

is properly extracting the 6 grayscale images in the channels of the example2-DeviceN.pdf file.

But clearly the organization of where to write the files needs to change. Ideas on how to do that? My initial thought was RenderImage() returns []io.Reader - along with all of its sub-fuctions, but that's going to be a lot of changes - all for this one unusual case.

adamgreenhall avatar Feb 02 '24 15:02 adamgreenhall

Awesome! Let me take a look.

hhrutter avatar Feb 02 '24 16:02 hhrutter

How did you figure out the necessary decoding for this? Did you take all of the following into account? Looks like your solution is hardcoded sort of..

 11:   offset=    5708 generation=0 types.Array
[DeviceN [Cyan Magenta Yellow Black coral light teal] DeviceCMYK (24 0 R) (25 0 R)]

 15:   offset= 3535694 generation=0 types.Array
[Separation coral DeviceRGB
	<<
		<C0, [1.00 1.00 1.00]>
		<C1, [1.00 0.56 0.57]>
		<Domain, [0 1]>
		<FunctionType, 2>
		<N, 1.00>
		<Range, [0.00 1.00 0.00 1.00 0.00 1.00]>
	>>
	]

19:   offset= 3535850 generation=0 types.Array
[Separation light teal DeviceRGB
	<<
		<C0, [1.00 1.00 1.00]>
		<C1, [0.00 0.62 0.65]>
		<Domain, [0 1]>
		<FunctionType, 2>
		<N, 1.00>
		<Range, [0.00 1.00 0.00 1.00 0.00 1.00]>
	>>
	]

24:   offset= 3537685 generation=0 types.StreamDict
<<
	<Domain, [0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00]>
	<Filter, FlateDecode>
	<FunctionType, 4>
	<Length, 185>
	<Range, [0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00]>
>>

25:   offset= 3538049 generation=0 types.Dict subType=NChannel
<<
	<Colorants, (26 0 R)>
	<Process, (27 0 R)>
	<Subtype, NChannel>
>>
   26:   offset= 3538119 generation=0 types.Dict
<<
	<coral, (15 0 R)>
	<light teal, (19 0 R)>
>>
   27:   offset= 3538173 generation=0 types.Dict
<<
	<ColorSpace, DeviceCMYK>
	<Components, [Cyan Magenta Yellow Black]>
>>

Your code is working on the assumption, that any DeviceN color space using more than 4 components is a CMYK plus Spot To MultiGray image.

I am unsure if we can commit to this - can we?

hhrutter avatar Feb 06 '24 01:02 hhrutter

Agree that this can't be committed as written. To get it to a place where we could merge, I think we'd want:

  1. a way to detect this CMYK+Spot type of image (rather than assuming that CMYK with >4 channels is automatically it). I'm not sure how to do this. Possibly the DeviceN colorspace plus the other Separation info in the PDF (matching the extra color channel names) could be a way to decide?
  2. a way to name/write multiple files per PDFImage that makes sense. I have an idea on this, but it's a little messy.

Do ^ those two make sense?

As for the encoding, I knew what the gray images should look like, and I tried <x,y,c> ordering options until the outputs looked right. I don't know that there is a spec for these InDesign generated PDFs.

adamgreenhall avatar Feb 08 '24 22:02 adamgreenhall