pypdfocr icon indicating copy to clipboard operation
pypdfocr copied to clipboard

Mixed dpi images per pdf page - configurable dpi default and/or mixed mode?

Open clowtown opened this issue 6 years ago • 1 comments

Has anybody else ran into the issue where any given pdf page can have 1+ images of different resolutions when evaluated with poppler? The issue has been default enforcement of 300dpi, where I need to have 150 for better (proven) ocr results. Other smaller tables need a smaller yet dpi to be processed properly. See below - but I'm considering a potentially slower performance solution of running the processes per page & image and stitching the final results together for tesseract to process.

image

clowtown avatar Aug 31 '17 19:08 clowtown

I've decided (for now) to allow the user to configure a minimal (not required) dpi threshold as well as multiple image types (image, stencil,etc). In addition - looking for the largest x/y dpi values across all image types for the time being in lieu of using a frequency to determine the best res. Checkout my fork on /clowtown/

clowtown avatar Sep 05 '17 14:09 clowtown