Fix: image extraction #853, #888, #1117 and some other stuff
Pull request
- Fixes https://github.com/pdfminer/pdfminer.six/issues/888.
- Fixes https://github.com/pdfminer/pdfminer.six/issues/1117.
- Add support for images with Indexed Colorspace, which fixes https://github.com/pdfminer/pdfminer.six/issues/853
- Removes an extraneous invert of greyscale image introduced in https://github.com/pdfminer/pdfminer.six/pull/827. This invert was introduced for the image in https://github.com/pdfminer/pdfminer.six/issues/795. However this only seemed correct because this image has an Indexed Colorspace that maps [0, 181] to [white, black].
- CMYK JPEG images are saved as is instead of converted to RGB. This behavior was added back in 2012 https://github.com/pdfminer/pdfminer.six/commit/6413eb7de4a0d9e96d0605d4c0d8f1680a8ad0ca. Support for CMYK JPEG has become better since then so this shouldn't be needed?
- Other CMYK images are saved in TIFF to avoid lossy compression.
How Has This Been Tested?
Looking at extracted images
Checklist
- [x] I have read CONTRIBUTING.md.
- [x] I have added a concise human-readable description of the change to CHANGELOG.md.
- [x] I have tested that this fix is effective or that this feature works.
- [x] I have added docstrings to newly created methods and classes.
- [x] I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.
Example pdf with images with Indexded DeviceCMYK Colorspace https://arxiv.org/pdf/2308.04079
Wow, thanks! That looks like some good improvements.
But can you split the MR per issue. That will allow me to review it, verify that it has no unintended consquences (at least try) and keeps the commit history clean.
Pillow is not necessary to save JPEG2000 to files, I'm not sure where the pdfminer.six team got this idea, but it has probably cost the world a lot of CPU cycles over the years (much like the insistence on parsing/unparsing JBIG2 noted in #654)
Closing this because of no response.
Code can be used as inspiration if the issues are picked up in the future.