change request: give me an option to not load image contents into memory when calling PdfDocument.GetPages

Open HertBp opened this issue 3 months ago • 1 comments

First off, thank you for all the hard work. Then, the problem I am encountering is the following: we use azure batch with cheap nodes to process pdf's in bulk, these cheap nodes have low RAM. Usually this works fine. However some publishers insist on cramming their pdf's with high resolution images. I am not interested in the contents of these images, only their size and location. But it looks like this contents is loaded into RAM anyway when I call PdfDocument.GetPages. This not only takes a lot of CPU, but also causes SystemOutOMemory exceptions which means we cannot parse the Pdf (and just wasted a bunch of processing time). The main problem is publishers putting these stupidly detailed pictures in the pdf, but we do not have any power over them to stop them doing this so if I could tell PdfPig not to load the pictures into memory, just their PdfRectangle it would save a huge amount of money.

edit: if this already exists, please point me to it, I could not find it

Sep 26 '25 08:09 HertBp

edit: if this already exists, please point me to it, I could not find it

There's some discussion about having flags to disable some of the processing over in https://github.com/UglyToad/PdfPig/issues/980#issuecomment-3092634566

Sep 29 '25 14:09 Numpsy