naps2 icon indicating copy to clipboard operation
naps2 copied to clipboard

OCR makes B&W PDF files too big

Open NextTherapist opened this issue 1 year ago • 6 comments

Describe the bug OCR makes B&W PDF files uncomprehensibly big.

To Reproduce Steps to reproduce the behavior:

  1. B&W-scan a DIN A4 sheet single sided with some B&W text. A typical text/letter sheet with some lines of 12pt black text content. Then save it as PDF A-2b, without doing OCR before. You get a wonderful and small file of about 20-30 KB.

  2. Take the same scan, activate OCR (german) and save it again as PDF A-2b. Now you get a file of about 70 KB, but the recognized text makes only about 3-4 KB (difference tested with other OCR Software)

So something in these OCRed files seems to be wrong. Perhaps the file is not saved CCITT compressed but in grey after OCR? I cannot control that.

Expected behavior The file made in 2. should have a maximum size of about 34 KB, not 70 KB.

Desktop (please complete the following information):

  • OS: Windows 10
  • Version: 7.4.0 32 bit

NextTherapist avatar Apr 04 '24 15:04 NextTherapist

The extra size is from embedding the font used to render the text, which is required by the PDF-A standard.

cyanfish avatar Apr 04 '24 18:04 cyanfish

I made some tests:

NAPS2 embeds a font, when the file contains OCR, independent of the PDF version! Files with OCR contain a subset of Times New Roman, files without OCR do not, and also PDF/A-2b files without OCR do not.

It should not be necessary to embed a font just because there is an invisible OCR text layer in the file.

And of course it would not be necessary to embed a font just because the file is PDF/A. As long as the file content is only a raster image, no font is needed to trusty display the content, and so there is no reason to embed one. But as said, NAPS2 does this right: no font in the scanned and OCR-free PDF/A, the font comes from OCR.

NextTherapist avatar Apr 05 '24 07:04 NextTherapist

Some OCR software uses a "fake" font instead of embedding a real font, but (a) that means the character measurements are off, which can cause alignment issues, and (b) that can cause various compatibility problems.

In theory it could be possible to provide an option to use a fake font like that, but I'm probably not going to do that.

cyanfish avatar Apr 05 '24 16:04 cyanfish

Now I tried to compare the OCR results of NAPS2 and PDF24, since both are based on Tesseract.

NAPS2 with OCR.pdf PDF24 with OCR.pdf

The PDF24 file is 65 KB smaller and to me it seems not to be less accurate in its alignment. It has "GlyphLessFont" embedded, which is perhaps what you meant.

NextTherapist avatar Apr 08 '24 09:04 NextTherapist

Perhaps the file sizes are bigger because NAPS2 uses PDFium for PDF generation instead of Ghostscript?

NextTherapist avatar Apr 16 '24 14:04 NextTherapist

I did a test with the "NAPS2 with OCR.pdf" file from above and optimized it with PDF XChange Editor, what mainly means it removed fonts. Result is a file of only 156 KB size, very similar to the PDF24 file. NAPS2.with.OCR_Optimized_A2b.pdf

I wanted to try if an embedded font is necessary at all for OCR, but yes, one font is still embedded: It's called "Untitled Truetype (CID) Identity-H" and the precision of OCR positions seems to be fine.

It would be great if NAPS2 could make such a small file by itself.

NextTherapist avatar Apr 25 '24 15:04 NextTherapist