rtesseract
rtesseract copied to clipboard
JPG parsed significantly better than PDF
When parsing the same file as a PDF instead of a JPG, I got far worse results. Is there an obvious reason for this difference?
I stumbled upon the same issue. Because tesseract cannot process PDF files, it is required to convert them to images first (e.g. TIFF). The gem is doing this, but uses low resolution and color depth. It's required to do the conversion with higher resolution (300dpi) and higher color depth (8bit).
Using the command line tool convert
from ImageMagick, the param -density 300 -depth 8
does the job.
More details here: http://kiirani.com/2013/03/22/tesseract-pdf.html