rtesseract JPG parsed significantly better than PDF

JPG parsed significantly better than PDF

Open pfletcherhill opened this issue 9 years ago • 1 comments

When parsing the same file as a PDF instead of a JPG, I got far worse results. Is there an obvious reason for this difference?

Apr 20 '15 17:04 pfletcherhill

I stumbled upon the same issue. Because tesseract cannot process PDF files, it is required to convert them to images first (e.g. TIFF). The gem is doing this, but uses low resolution and color depth. It's required to do the conversion with higher resolution (300dpi) and higher color depth (8bit).

Using the command line tool convert from ImageMagick, the param -density 300 -depth 8 does the job.

More details here: http://kiirani.com/2013/03/22/tesseract-pdf.html

Sep 23 '15 09:09 ledermann

rtesseract rtesseract copied to clipboard

JPG parsed significantly better than PDF

rtesseract
rtesseract copied to clipboard