pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Fix: image extraction #853, #888, #1117 and some other stuff

Open lambdalemon opened this issue 7 months ago • 3 comments

Pull request

  • Fixes https://github.com/pdfminer/pdfminer.six/issues/888.
  • Fixes https://github.com/pdfminer/pdfminer.six/issues/1117.
  • Add support for images with Indexed Colorspace, which fixes https://github.com/pdfminer/pdfminer.six/issues/853
  • Removes an extraneous invert of greyscale image introduced in https://github.com/pdfminer/pdfminer.six/pull/827. This invert was introduced for the image in https://github.com/pdfminer/pdfminer.six/issues/795. However this only seemed correct because this image has an Indexed Colorspace that maps [0, 181] to [white, black].
  • CMYK JPEG images are saved as is instead of converted to RGB. This behavior was added back in 2012 https://github.com/pdfminer/pdfminer.six/commit/6413eb7de4a0d9e96d0605d4c0d8f1680a8ad0ca. Support for CMYK JPEG has become better since then so this shouldn't be needed?
  • Other CMYK images are saved in TIFF to avoid lossy compression.

How Has This Been Tested?

Looking at extracted images

Checklist

  • [x] I have read CONTRIBUTING.md.
  • [x] I have added a concise human-readable description of the change to CHANGELOG.md.
  • [x] I have tested that this fix is effective or that this feature works.
  • [x] I have added docstrings to newly created methods and classes.
  • [x] I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

lambdalemon avatar May 16 '25 11:05 lambdalemon

Example pdf with images with Indexded DeviceCMYK Colorspace https://arxiv.org/pdf/2308.04079

lambdalemon avatar May 17 '25 07:05 lambdalemon

Wow, thanks! That looks like some good improvements.

But can you split the MR per issue. That will allow me to review it, verify that it has no unintended consquences (at least try) and keeps the commit history clean.

pietermarsman avatar May 26 '25 16:05 pietermarsman

Pillow is not necessary to save JPEG2000 to files, I'm not sure where the pdfminer.six team got this idea, but it has probably cost the world a lot of CPU cycles over the years (much like the insistence on parsing/unparsing JBIG2 noted in #654)

dhdaines avatar Jun 17 '25 18:06 dhdaines

Closing this because of no response.

Code can be used as inspiration if the issues are picked up in the future.

pietermarsman avatar Nov 07 '25 20:11 pietermarsman