PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Support DCT Filter

Open aagubanov opened this issue 2 years ago • 6 comments

This is the enhancement request. The filter DCT (Discrete Cosine Transform) is not supported, and as a result, some embedded JPG pictures cannot be extracted.

aagubanov avatar Jan 08 '23 16:01 aagubanov

@aagubanov , @EliotJones,

Do have this well underway.

The DCT decode itself is well developed (for 1 of 4 modes) however post decode there is a large variation and complexity yet to be addressed including 1. translating to (around 8 or so) “final” colorspaces 2.(sub)sampling 3. stretch 4. masks (alpha/transparency)

Beyond implementation will be testing.

The test matrix is large with a large number of options and combinations. Beyond testing from “cold” hand build test PDFs from scratch is finding “in the wild” examples of “good” and “bad” implementations.

Relates to #484 ; image export from provided PDF starts with DCT filter before usage of ColorSeparation colorspace.

ColorSeparation colorspace itself makes use of a “Tint function” which can be implemented in 4 modes: 0 Sampled function 2 Exponential interpolation function 3 Stitching function 4 PostScript calculator function

These are also well underway however testing these again will be significant.

Have found 11 (public) “in the wild” example PDFs using separation colorspace [it's rare].

DCT (Discrete Cosine Transform) based on ITU-T81 4.5 has four distinct modes of operation with various coding processes: 1. sequential DCT-based, 2. progressive DCT-based, 3. lossless, and 4. hierarchical. Currently only mode 1 is implemented. PDF spec calls for mode 2 support (but unlikely to be needed in practice for most PDFs). Supports 8-bit grayscale and YCbCr images. Translation to RGB colorspace done. Other colorspaces require work. Supports restart markers.

Adobe Technical Note TN.5116 details additional decode handling (from inside a PDF) including support for App14 "Adobe" Application Segment hint for colorspace transform support. The default is to use the YCC-to-RGB [color]transform. Byte 11 signals color translations of: // 0 = CMYK
// 1== YCCK

8 bit only (16bit or others require down/up sampling to 8 bit; yet to be implemented).

After all post image processing implemented final step will be translating (Device Independent Bitmap) to PNG for final export from library.

So coming but not soon.

fnatzke avatar Jan 09 '23 04:01 fnatzke

@fnatzke I've implemented the 4 function types in https://github.com/UglyToad/PdfPig/pull/557 and the separation colorspace now loads the actual function

Also, you can find a lot of strange pdf (I'm pretty sure you'll find PDFs using separation colorspace) here https://github.com/pdf-association/pdf-corpora#safedocs-issue-tracker-corpus

BobLd avatar Mar 04 '23 14:03 BobLd

Well that's a surprise. Think I've recovered. Is there a way we can coordinate contributions?

Published your link and the URLS of PDFs I've found (so 30,000 ~ 50GB) in https://github.com/UglyToad/PdfPig/discussions/302.

For the Urls I've found provided a command line to download in the discussion. Hope it can help someone.

fnatzke avatar Mar 22 '23 01:03 fnatzke

@fnatzke I'm going to create a discussion where we can coordinate

BobLd avatar Mar 22 '23 19:03 BobLd

@BobLd do you happen to know the current state of this, is DCT support now complete?

EliotJones avatar May 21 '23 11:05 EliotJones

@EliotJones as far as I know it's not. Not sure if @fnatzke is still working on that or not

BobLd avatar May 21 '23 14:05 BobLd