pdf-toolbox icon indicating copy to clipboard operation
pdf-toolbox copied to clipboard

Extracting images and diagrams from XObjects

Open coderfromhere opened this issue 4 years ago • 3 comments
trafficstars

Hi!

Would you be open to extend XObjects to support API that would deal with non-text (images in my particular case) data extraction only? Something along the line of https://blog.idrsolutions.com/how-images-are-stored-in-pdf/

Hopefully I might be able to contribute it as part of my working hours, I'm exploring my options at the moment :)

coderfromhere avatar Jul 29 '21 11:07 coderfromhere

@coderfromhere Sure thing! Few issues to consider:

  • API for extracting should be designed carefully. It should be easy to use and at the same time don't impose arbitrary restrictions.
  • Dependency footprint should not grow too much. You might need libraries for different type of images, right?

Yuras avatar Jul 29 '21 11:07 Yuras

So far I was just thinking of providing streaming of bitmaps/random bytes (without full-featured format recogniser) into arbitrary locations, either through conduit, or as I see it in the dependencies already, io-streams.

coderfromhere avatar Jul 29 '21 11:07 coderfromhere

So it's a low level API to extract raw image data + metadata. Sounds reasonable for me! I'd prefer io-streams since they are there already.

Yuras avatar Jul 29 '21 12:07 Yuras