pdf-toolbox Extracting images and diagrams from XObjects

Extracting images and diagrams from XObjects

Open coderfromhere opened this issue 4 years ago • 3 comments

trafficstars

Hi!

Would you be open to extend XObjects to support API that would deal with non-text (images in my particular case) data extraction only? Something along the line of https://blog.idrsolutions.com/how-images-are-stored-in-pdf/

Hopefully I might be able to contribute it as part of my working hours, I'm exploring my options at the moment :)

Jul 29 '21 11:07 coderfromhere

@coderfromhere Sure thing! Few issues to consider:

API for extracting should be designed carefully. It should be easy to use and at the same time don't impose arbitrary restrictions.
Dependency footprint should not grow too much. You might need libraries for different type of images, right?

Jul 29 '21 11:07 Yuras

So far I was just thinking of providing streaming of bitmaps/random bytes (without full-featured format recogniser) into arbitrary locations, either through conduit, or as I see it in the dependencies already, io-streams.

Jul 29 '21 11:07 coderfromhere

So it's a low level API to extract raw image data + metadata. Sounds reasonable for me! I'd prefer io-streams since they are there already.

Jul 29 '21 12:07 Yuras

pdf-toolbox pdf-toolbox copied to clipboard

Extracting images and diagrams from XObjects

pdf-toolbox
pdf-toolbox copied to clipboard