pdfocr icon indicating copy to clipboard operation
pdfocr copied to clipboard

This package does not work for me...

Open sch82812121 opened this issue 10 years ago • 6 comments

Somehow, it does not seem to work with my directory layout (Ubuntu 10.04). This seems to be a tesseract-related issue (Cuneiform seems to work)...

pdfocr -i beleg0059.pdf -o b59.pdf Input file is /home/samba-shares/family/scans/beleg0059.pdf Output file is /home/samba-shares/family/scans/b59.pdf Using working dir /tmp/d20131230-26500-1fddng Getting info from PDF file

Warning: no info dictionary found NumberOfPages: 4

Converting 4 pages

Extracting page 1 Converting page 1 to ppm Running OCR on page 1 read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/1.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6' mv: cannot stat `1.hocr.html': No such file or directory

Error while running OCR on page 1

Extracting page 2 Converting page 2 to ppm Running OCR on page 2 read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/2.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6' mv: cannot stat `2.hocr.html': No such file or directory

Error while running OCR on page 2

Extracting page 3 Converting page 3 to ppm Running OCR on page 3 read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/3.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6' mv: cannot stat `3.hocr.html': No such file or directory

Error while running OCR on page 3

Extracting page 4 Converting page 4 to ppm Running OCR on page 4 read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/4.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6' mv: cannot stat `4.hocr.html': No such file or directory Error while running OCR on page 4 Merging together PDF files /tmp/d20131230-26500-1fddng/-new.pdf not found as file or resource. Error: Failed to open PDF file: /tmp/d20131230-26500-1fddng/-new.pdf Errors encountered. No output created. Done. Input errors, so no output created. Updating PDF info for /home/samba-shares/family/scans/b59.pdf /tmp/d20131230-26500-1fddng/merged.pdf not found as file or resource. Error: Failed to open PDF file: /tmp/d20131230-26500-1fddng/merged.pdf Errors encountered. No output created. Done. Input errors, so no output created. Cleaning up temporary files

sch82812121 avatar Dec 30 '13 09:12 sch82812121

Hi, and thanks for sharing your code !

I have the same issue as @sch82812121, with log such as:

Converting 4 pages
==========
Extracting page 1
Converting page 1 to ppm
Running OCR on page 1
Tesseract Open Source OCR Engine v3.03 with Leptonica
mv: cannot stat ‘1.hocr.html’: No such file or directory
Error while running OCR on page 1
==========

and so on for each of 4 pages. Maybe you have an idea where this could come from? Best, Mahé

perrette avatar May 13 '14 14:05 perrette

I can confirm this bug on ubuntu 14.04

johanovic avatar May 29 '14 09:05 johanovic

I'm on Ubuntu 14.04 and seeing the same error.

ashwin avatar Jun 09 '14 08:06 ashwin

Same on my Ubuntu 14.04.

xylo avatar Jun 29 '14 17:06 xylo

Hi!

Tesseract in Version 3.03 does not use the .html extention for the hOCR files anymore, it uses .hocr instead. To fix it you can edit the sourcecode in pdfocr.rb, lines 336 to look like this:

sh "tesseract", "-l", language, basefn+'.ppm', basefn, "hocr" and remove or comment out the next line

sh "mv", basefn+'.hocr.hocr', basefn+'.hocr'

However, an even better solution is found here: https://github.com/snowboard975/pdfocr/commit/4d274c918346cb56a7a766cef566e6fb4b11171e so long hank

hankschwie avatar Jul 27 '14 16:07 hankschwie

Same on my Ubuntu 14.04

Wish this issue was titled something more relevant... This is an important fix

mmcraedhcu avatar Nov 05 '14 22:11 mmcraedhcu