rtesseract icon indicating copy to clipboard operation
rtesseract copied to clipboard

JPG parsed significantly better than PDF

Open pfletcherhill opened this issue 9 years ago • 1 comments

When parsing the same file as a PDF instead of a JPG, I got far worse results. Is there an obvious reason for this difference?

pfletcherhill avatar Apr 20 '15 17:04 pfletcherhill

I stumbled upon the same issue. Because tesseract cannot process PDF files, it is required to convert them to images first (e.g. TIFF). The gem is doing this, but uses low resolution and color depth. It's required to do the conversion with higher resolution (300dpi) and higher color depth (8bit).

Using the command line tool convert from ImageMagick, the param -density 300 -depth 8 does the job.

More details here: http://kiirani.com/2013/03/22/tesseract-pdf.html

ledermann avatar Sep 23 '15 09:09 ledermann