archive-hocr-tools
archive-hocr-tools copied to clipboard
Non-integer confidences cause error parsing
If your confidence is not a whole number then parsing it throws an Exception at line 186 of parse.py
Traceback (most recent call last):
File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/bin/recode_pdf", line 302, in <module>
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 640, in recode
create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 210, in create_tess_textonly_pdf
word_data = hocr_page_to_word_data(hocr_page, font_scaler)
File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/env/lib/python3.9/site-packages/hocr/parse.py", line 186, in hocr_page_to_word_data
conf = int(m.group(1).split()[0])
ValueError: invalid literal for int() with base 10: '0.988'
Code that offends is.
conf = int(m.group(1).split()[0])
You can also just test this with any old python.
> python3
Python 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print(int("99"))
99
>>> print(int("99.9"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '99.9'
Solution is to convert to float()
first.
conf = int(float(m.group(1).split()[0]))
>>> print(int(float("99.9")))
99
>>> print(int(float("99")))
99
or perhaps use the float instead of an integer