TextRecognitionDataGenerator
TextRecognitionDataGenerator copied to clipboard
run.py crashes (OSError: unknown file format) when generating dataset for new non- latin language
I added support for hebrew with some .fft fonts and a dictionary. The adjusted run.py and datagenerator.py files run and work till they crash. When I put run.py in a for loop it some times works flawlessly (and generates images) and sometimes crashes. Any thoughts?
text_rec_task) galmoore@Gals-MacBook-Pro:~/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator$ python run_script.py
Missing modules for handwritten text generation.
args count10
100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 56.40it/s]
Missing modules for handwritten text generation.
args count10
100%|█████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 102.17it/s]
Missing modules for handwritten text generation.
args count10
100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 67.58it/s]
Missing modules for handwritten text generation.
args count10
100%|█████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 108.27it/s]
Missing modules for handwritten text generation.
args count10
100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 70.82it/s]
Missing modules for handwritten text generation.
args count10
0%| | 0/10 [00:00<?, ?it/s]multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 23, in generate_from_tuple
cls.generate(*t)
File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 42, in generate
image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit)
File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 7, in generate
return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit)
File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 14, in _generate_horizontal_text
image_font = ImageFont.truetype(font=font, size=font_size)
File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 280, in truetype
return FreeTypeFont(font, size, index, encoding, layout_engine)
File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 145, in init
layout_engine=layout_engine)
OSError: unknown file format
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run.py", line 376, in
main()
File "run.py", line 364, in main
), total=args.count):
File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1005, in iter
for obj in iterable:
File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
OSError: unknown file format
Missing modules for handwritten text generation.
args count10
100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 70.12it/s]
Missing modules for handwritten text generation.
args count10
100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 44.38it/s]
Missing modules for handwritten text generation.
args count10
100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 37.21it/s]
Missing modules for handwritten text generation.
args count10
100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 53.60it/s]
Can you send me the font + dictionary that you are using? I'd be glad to troubleshoot it on my side if I have a test case.
Thanks, much appreciated. You can grab the dictionary + font here (as well as the data_generator.py and run.py scripts I am using). Look forward to hearing from you! https://github.com/GalMoore/text_generation_task
Hey @Belval any chance reproducing this?
Not yet sorry, I just finished my midterm exams so I should be able to test it shortly.
It seems to work on my side with the font and dict you provided?
python3 run.py -c 10 -w 1 -l he
Although the filenames are inverted which would have to be fixed. I think I'll add it to the official repo, this is pretty nice.
So I might have found something, in the he.txt
file, there is an empty line which causes run.py
to panic because the image has no width. I'll fix that.
As for the OSError you encounter, I am unable to reproduce it on Ubuntu 18.04 with Python 3.7. I generated over 100 000 samples with multithreading activated (to see if it was a concurrency issue) and I didn't hit anything.
What is your OS? What is your Python version?
Thanks, I've tried on MacOS Mojave in an Anaconda environment with Python 3.6.8 and on an Ubuntu 18.04 with similar setup. This is how it looks when I run run.py with the args you used above:
python3 run.py -c 50 -w 1 -l he Missing modules for handwritten text generation. args count50 16%|███████ | 8/50 [00:00<00:00, 78.26it/s]multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 23, in generate_from_tuple cls.generate(*t) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 42, in generate image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 7, in generate return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 14, in _generate_horizontal_text image_font = ImageFont.truetype(font=font, size=font_size) File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 280, in truetype return FreeTypeFont(font, size, index, encoding, layout_engine) File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 145, in init layout_engine=layout_engine) OSError: unknown file format """
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run.py", line 376, in
I added support for korean with some .ttf fonts and a dictionary. The adjusted run.py and datagenerator.py files run and work till they crash also. Look forward to hearing from you~
https://github.com/parksunwoo/TextRecognitionDataGenerator/tree/master/TextRecognitionDataGenerator/fonts/ko/ https://github.com/parksunwoo/TextRecognitionDataGenerator/tree/master/TextRecognitionDataGenerator/dicts/ko.txt
This seems to be a very nasty bug with Pillow unfortunately. I'm looking into ways to either fix or circumvent the issue. You can follow the fix progression here: https://github.com/python-pillow/Pillow/issues/3066
It's a temporary fix, but it's quick to create the data,
while true; do python run.py -w 5 -f 64 -l ko; sleep 1; done;
I just succeed
python run.py -w 5 -f 64
however, when I run below command in windows,
python run.py -w 5 -f 64 -l ko
I meet
2020-08-07 16:52:00.595597: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll n
ot found
2020-08-07 16:52:00.604293: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
0%| | 0/1000 [00:00<?, ?it/s]2
020-08-07 16:52:06.775092: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll no
t found
2020-08-07 16:52:06.784047: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\data_generator.py", line 22, in generate_from_tuple
cls.generate(*t)
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\data_generator.py", line 40, in generate
image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit)
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\computer_text_generator.py", line 7, in generate
return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit)
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\computer_text_generator.py", line 14, in _generate_horizontal_text
image_font = ImageFont.truetype(font=font, size=font_size)
File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\site-packages\PIL\ImageFont.py", line 642, in truetype
return freetype(font)
File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\site-packages\PIL\ImageFont.py", line 639, in freetype
return FreeTypeFont(font, size, index, encoding, layout_engine)
File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\site-packages\PIL\ImageFont.py", line 188, in __init__
font, size, index, encoding, layout_engine=layout_engine
OSError: unknown file format
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run.py", line 374, in <module>
main()
File "run.py", line 362, in main
), total=args.count):
File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\site-packages\tqdm\std.py", line 1104, in __iter__
for obj in iterable:
File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 735, in next
raise value
OSError: unknown file format
0%|
It seems OSError cause I run this code in windows?? should I run this code on linux?
It's a temporary fix, but it's quick to create the data,
while true; do python run.py -w 5 -f 64 -l ko; sleep 1; done;
Korean: 어떻게 고치는지 알 수 있을까요?? windows에서 저걸 돌리기 위해 어떻게 해야 할까요...?ㅠ 혹시 에러가 나는데 아주 적은 확률로 성공할 때가 있으므로 while 루프를 돌린다는 말씀이신지요?
또 제가 주피터 노트북에서 임의로 코드를 만들어서 다음과 같이 돌려보았는데요,
import time
while True:
!run.py -w 5 -f 64 -l ko
time.sleep(1)
OSError: unknown file format는 안나는데 import cv2 에러가 뜹니다.. cv2 라이브러리를 설치하면 해결될 문제일까요...?
Traceback (most recent call last):
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\run.py", line 14, in <module>
from data_generator import FakeTextDataGenerator
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\data_generator.py", line 7, in <module>
import background_generator
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\background_generator.py", line 1, in <module>
import cv2
ModuleNotFoundError: No module named 'cv2'
Traceback (most recent call last):
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\run.py", line 14, in <module>
from data_generator import FakeTextDataGenerator
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\data_generator.py", line 7, in <module>
import background_generator
File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\background_generator.py", line 1, in <module>
Thanks, I've tried on MacOS Mojave in an Anaconda environment with Python 3.6.8 and on an Ubuntu 18.04 with similar setup. This is how it looks when I run run.py with the args you used above:
python3 run.py -c 50 -w 1 -l he Missing modules for handwritten text generation. args count50 16%|███████ | 8/50 [00:00<00:00, 78.26it/s]multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 23, in generate_from_tuple cls.generate(*t) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 42, in generate image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 7, in generate return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 14, in _generate_horizontal_text image_font = ImageFont.truetype(font=font, size=font_size) File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 280, in truetype return FreeTypeFont(font, size, index, encoding, layout_engine) File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 145, in init layout_engine=layout_engine) OSError: unknown file format """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "run.py", line 376, in main() File "run.py", line 364, in main ), total=args.count): File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1005, in iter for obj in iterable: File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 735, in next raise value OSError: unknown file format
Hi, I just got some issue, it maybe caused by invalid ttf error. please check the ttf file in the font folder. just use only 1 ttf file for test. if not working, download trustable ttf file and replace it.
ref : https://github.com/parksunwoo/ocr_kor/issues/11 (korean)
The included fonts does not seem to be valid. I am overdue for testing all the fonts in the repo and making sure that they all work. I'll try to find some bandwidth to do that.
In your Korean message, you seem to be missing the cv2 dependency. You will want to install it.