pdfsak icon indicating copy to clipboard operation
pdfsak copied to clipboard

Suggestions to improve ClearScan

Open raffaem opened this issue 1 year ago • 1 comments

Thanks for your hard-working and for creating such a great tool!!!! I've got some questions:

It seems that the higher quality ClearScan ( with a higher --clearscan-upscaling-factor ) increases the file size a lot. Also, it messes up the figures.

Is it possible to apply OCR first and only perform a clear scan on the area with text? Also, I'm not sure how to do this. But if I understand correctly, it seems that Acrobat's clear scan creates a kind of font to reduce the size of the final file. Is there any way to implement this kind of function?

Thanks a lot!

Originally posted by @c0rychu in https://github.com/raffaem/pdfsak/discussions/12

raffaem avatar Jul 17 '22 05:07 raffaem

Hi,

thanks for the suggestions.

The best way to mimic Adobe and the new fonts it create is probably to compress the resulting PDF with JBIG2.

There is an open source encoder here.

Problems are: (1) the encoder seems abandoned (last commit dated 2019) (2) you need to compile it from source (3) ImageMagick doesn't seem to support JBIG2 compression for PDF files.

Regarding excluding images from ClearScan, I'm afraid it would be very very difficult.

PDFsak operates differently from Adobe.

The passages to mimic clearscan are the following:

  1. The PDF is converted into an image
  2. The image is passed to potrace
  3. The image is converted back into PDFs and merged

I currently don't have a clear idea how we can exclude existing images from this process.

raffaem avatar Jul 17 '22 05:07 raffaem