videocr
videocr copied to clipboard
ValueError: invalid literal for int() with base 10
ValueError: 'invalid literal for int() with base 10: '"₪ץ'' (several different words get caught here)
function get_subtitles in api.py at line 11
v.run_ocr(lang, time_start, time_end, conf_threshold, use_fullframe)
function run_ocr in video.py at line 52
for i, data in enumerate(it_ocr)
function
same problem python3.8 tesseract-ocr-w64-v5.0.0-alpha.20201127
I also had this same problem, seemed to only be with reading Hebrew. Could be a right to left thing?
@HarryRudolph Yes , I think it's a right to left languages problem.
Error Log :
ValueError: invalid literal for int() with base 10: 'ارره'
That ارره
is a Persian word , it seems have a problem on RTL languages.
Code :
print(get_subtitles('video.mp4', lang='fas', sim_threshold=70, conf_threshold=65))
a debug from models.py with print of word_data print(word_data)
:
word_data = l.split()
print(word_data) // <-- this line added
if len(word_data) < 12:
this is the last lines that got an error :
['4', '1', '1', '1', '2', '0', '607', '76', '111', '74', '-1']
['5', '1', '1', '1', '2', '1', '607', '76', '111', '97', '20', '4']
['4', '1', '1', '1', '3', '0', '217', '169', '486', '71', '-1']
['5', '1', '1', '1', '3', '1', '306', '162', '212', '78', '1', 'لارنج']
['5', '1', '1', '1', '3', '2', '191', '189', '100', '51', '0', 'ارره', '\u200f']
Program crash when word_data got 13 column instead of 12. So I added a skip for more than 13 columns with this :
if len(word_data) > 12:
continue
Program will work until end but the result at end is just an half a line .
In models.py
, replace line 32: block_num, conf = int(block_num), int(conf)
with block_num, conf = int(block_num), int(float(conf))
.
The issue is that conf
is a string of a float value, which int()
is not able to convert. By doing float(conf)
, the float value string is correctly converted into a float, which is able to be converted to an int with int()
.
@PlaylistsTrance Your solution leads to this error:
block_num, conf = int(block_num), int(float(conf))
ValueError: could not convert string to float: 'שם'
It seems that for some reason the OCRed text is being stored in conf? I am assuming this is incorrect and that conf should be storing an integer/float representing percentage confidence.
The assignment in line 31 of models.py is maybe getting confused with the right to left text?
_, _, block_num, *_, conf, text = word_data
@HarryRudolph I've check parameters that given from Tesseract and it seems the problems are just with this two :
Problem 1 :
On RTL languages we got one more parameter that indicate it's RTL. some word_data
have 13 parameter instead of 12.
So add this line after if len(word_data) < 12: # no word is predicted continue
will solve this.
if len(word_data) == 13:
_, _, block_num, *_, conf, text, _ = word_data
else:
_, _, block_num, *_, conf, text = word_data
Problem 2 :
Some of lines got a confidence value in float
or StringFloat
that got an error of invalid literal for int() with base 10
.
To solve this I've added a method (is_float) to check if conf
is float or not with this after __init__
:
def is_float(value):
try:
float(value)
return True
except:
return False
And replace block_num, conf = int(block_num), int(conf)
with below codes :
if is_float(conf):
block_num, conf = int(block_num), int(float(conf))
else:
block_num, conf = int(block_num), int(conf)
Result : Program will run without any error but I've just tested this with Arabic/Persian languages but it seems the Tesseract don't get a good OCR on them and the result is not what I want. Please test it on other languages like Hebrew and feedback.