archive-hocr-tools icon indicating copy to clipboard operation
archive-hocr-tools copied to clipboard

Non-integer confidences cause error parsing

Open whikloj opened this issue 1 year ago • 1 comments

If your confidence is not a whole number then parsing it throws an Exception at line 186 of parse.py

Traceback (most recent call last):
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/bin/recode_pdf", line 302, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 640, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 210, in create_tess_textonly_pdf
    word_data = hocr_page_to_word_data(hocr_page, font_scaler)
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/env/lib/python3.9/site-packages/hocr/parse.py", line 186, in hocr_page_to_word_data
    conf = int(m.group(1).split()[0])
ValueError: invalid literal for int() with base 10: '0.988'

Code that offends is.

conf = int(m.group(1).split()[0])

You can also just test this with any old python.

> python3 
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print(int("99"))
99
>>> print(int("99.9"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '99.9'

Solution is to convert to float() first.

conf = int(float(m.group(1).split()[0]))
>>> print(int(float("99.9")))
99
>>> print(int(float("99")))
99

or perhaps use the float instead of an integer

whikloj avatar Feb 01 '24 21:02 whikloj