TextRecognitionDataGenerator run.py crashes (OSError: unknown file format) when generating dataset for new non- latin language

I added support for hebrew with some .fft fonts and a dictionary. The adjusted run.py and datagenerator.py files run and work till they crash. When I put run.py in a for loop it some times works flawlessly (and generates images) and sometimes crashes. Any thoughts?

text_rec_task) galmoore@Gals-MacBook-Pro:~/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator$ python run_script.py

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 56.40it/s]

Missing modules for handwritten text generation.

args count10

100%|█████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 102.17it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 67.58it/s]

Missing modules for handwritten text generation.

args count10

100%|█████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 108.27it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 70.82it/s]

Missing modules for handwritten text generation.

args count10

0%| | 0/10 [00:00<?, ?it/s]multiprocessing.pool.RemoteTraceback:

"""

Traceback (most recent call last):

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 119, in worker

result = (True, func(*args, **kwds))

File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 23, in generate_from_tuple

cls.generate(*t)

File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 42, in generate

image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit)

File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 7, in generate

return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit)

File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 14, in _generate_horizontal_text

image_font = ImageFont.truetype(font=font, size=font_size)

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 280, in truetype

return FreeTypeFont(font, size, index, encoding, layout_engine)

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 145, in init

layout_engine=layout_engine)

OSError: unknown file format

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "run.py", line 376, in

main()

File "run.py", line 364, in main

), total=args.count):

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1005, in iter

for obj in iterable:

File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 735, in next

raise value

OSError: unknown file format

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 70.12it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 44.38it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 37.21it/s]

Missing modules for handwritten text generation.

args count10

100%|██████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 53.60it/s]

Jun 10 '19 22:06 GalMoore

Can you send me the font + dictionary that you are using? I'd be glad to troubleshoot it on my side if I have a test case.

Jun 11 '19 02:06 Belval

Thanks, much appreciated. You can grab the dictionary + font here (as well as the data_generator.py and run.py scripts I am using). Look forward to hearing from you! https://github.com/GalMoore/text_generation_task

Jun 11 '19 06:06 GalMoore

Hey @Belval any chance reproducing this?

Jun 13 '19 07:06 GalMoore

Not yet sorry, I just finished my midterm exams so I should be able to test it shortly.

Jun 13 '19 11:06 Belval

It seems to work on my side with the font and dict you provided?

python3 run.py -c 10 -w 1 -l he

חצאית_5 כסף_7 מוח_2 מתוק_3 סטודנט_6 צפון_4 קינואה_8 רב_1 שכונה_9 שניצל_0

Although the filenames are inverted which would have to be fixed. I think I'll add it to the official repo, this is pretty nice.

Jun 13 '19 20:06 Belval

So I might have found something, in the he.txt file, there is an empty line which causes run.py to panic because the image has no width. I'll fix that.

As for the OSError you encounter, I am unable to reproduce it on Ubuntu 18.04 with Python 3.7. I generated over 100 000 samples with multithreading activated (to see if it was a concurrency issue) and I didn't hit anything.

What is your OS? What is your Python version?

Jun 14 '19 13:06 Belval

Thanks, I've tried on MacOS Mojave in an Anaconda environment with Python 3.6.8 and on an Ubuntu 18.04 with similar setup. This is how it looks when I run run.py with the args you used above:

python3 run.py -c 50 -w 1 -l he Missing modules for handwritten text generation. args count50 16%|███████ | 8/50 [00:00<00:00, 78.26it/s]multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 23, in generate_from_tuple cls.generate(*t) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 42, in generate image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 7, in generate return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 14, in _generate_horizontal_text image_font = ImageFont.truetype(font=font, size=font_size) File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 280, in truetype return FreeTypeFont(font, size, index, encoding, layout_engine) File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 145, in init layout_engine=layout_engine) OSError: unknown file format """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "run.py", line 376, in main() File "run.py", line 364, in main ), total=args.count): File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1005, in iter for obj in iterable: File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 735, in next raise value OSError: unknown file format

Jun 16 '19 10:06 GalMoore

I added support for korean with some .ttf fonts and a dictionary. The adjusted run.py and datagenerator.py files run and work till they crash also. Look forward to hearing from you~

https://github.com/parksunwoo/TextRecognitionDataGenerator/tree/master/TextRecognitionDataGenerator/fonts/ko/ https://github.com/parksunwoo/TextRecognitionDataGenerator/tree/master/TextRecognitionDataGenerator/dicts/ko.txt

Jul 29 '19 14:07 parksunwoo

This seems to be a very nasty bug with Pillow unfortunately. I'm looking into ways to either fix or circumvent the issue. You can follow the fix progression here: https://github.com/python-pillow/Pillow/issues/3066

Jul 29 '19 15:07 Belval

It's a temporary fix, but it's quick to create the data,

while true; do python run.py -w 5 -f 64 -l ko; sleep 1; done;

Jul 31 '19 13:07 parksunwoo

I just succeed

python run.py -w 5 -f 64 however, when I run below command in windows,

python run.py -w 5 -f 64 -l ko

I meet

2020-08-07 16:52:00.595597: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll n
ot found
2020-08-07 16:52:00.604293: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
  0%|                                                                                                                                           | 0/1000 [00:00<?, ?it/s]2
020-08-07 16:52:06.775092: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll no
t found
2020-08-07 16:52:06.784047: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\data_generator.py", line 22, in generate_from_tuple
    cls.generate(*t)
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\data_generator.py", line 40, in generate
    image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit)
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\computer_text_generator.py", line 7, in generate
    return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit)
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\computer_text_generator.py", line 14, in _generate_horizontal_text
    image_font = ImageFont.truetype(font=font, size=font_size)
  File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\site-packages\PIL\ImageFont.py", line 642, in truetype
    return freetype(font)
  File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\site-packages\PIL\ImageFont.py", line 639, in freetype
    return FreeTypeFont(font, size, index, encoding, layout_engine)
  File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\site-packages\PIL\ImageFont.py", line 188, in __init__
    font, size, index, encoding, layout_engine=layout_engine
OSError: unknown file format
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run.py", line 374, in <module>
    main()
  File "run.py", line 362, in main
    ), total=args.count):
  File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\site-packages\tqdm\std.py", line 1104, in __iter__
    for obj in iterable:
  File "C:\Users\owner\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 735, in next
    raise value
OSError: unknown file format
  0%|

It seems OSError cause I run this code in windows?? should I run this code on linux?

Aug 07 '20 07:08 NeighborhoodCoding

It's a temporary fix, but it's quick to create the data,
while true; do python run.py -w 5 -f 64 -l ko; sleep 1; done;

Korean: 어떻게 고치는지 알 수 있을까요?? windows에서 저걸 돌리기 위해 어떻게 해야 할까요...?ㅠ 혹시 에러가 나는데 아주 적은 확률로 성공할 때가 있으므로 while 루프를 돌린다는 말씀이신지요?

또 제가 주피터 노트북에서 임의로 코드를 만들어서 다음과 같이 돌려보았는데요,

import time
while True: 
    !run.py -w 5 -f 64 -l ko
    time.sleep(1)

OSError: unknown file format는 안나는데 import cv2 에러가 뜹니다.. cv2 라이브러리를 설치하면 해결될 문제일까요...?

Traceback (most recent call last):
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\run.py", line 14, in <module>
    from data_generator import FakeTextDataGenerator
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\data_generator.py", line 7, in <module>
    import background_generator
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\background_generator.py", line 1, in <module>
    import cv2
ModuleNotFoundError: No module named 'cv2'
Traceback (most recent call last):
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\run.py", line 14, in <module>
    from data_generator import FakeTextDataGenerator
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\data_generator.py", line 7, in <module>
    import background_generator
  File "D:\ocr_kor-master\ocr_kor-master\data\generator\TextRecognitionDataGenerator\background_generator.py", line 1, in <module>

Aug 07 '20 08:08 NeighborhoodCoding

Thanks, I've tried on MacOS Mojave in an Anaconda environment with Python 3.6.8 and on an Ubuntu 18.04 with similar setup. This is how it looks when I run run.py with the args you used above:

python3 run.py -c 50 -w 1 -l he Missing modules for handwritten text generation. args count50 16%|███████ | 8/50 [00:00<00:00, 78.26it/s]multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 23, in generate_from_tuple cls.generate(*t) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/data_generator.py", line 42, in generate image = computer_text_generator.generate(text, font, text_color, size, orientation, space_width, fit) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 7, in generate return _generate_horizontal_text(text, font, text_color, font_size, space_width, fit) File "/Users/galmoore/anaconda/envs/text_rec_task/TextRecognitionDataGenerator/TextRecognitionDataGenerator/computer_text_generator.py", line 14, in _generate_horizontal_text image_font = ImageFont.truetype(font=font, size=font_size) File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 280, in truetype return FreeTypeFont(font, size, index, encoding, layout_engine) File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/PIL/ImageFont.py", line 145, in init layout_engine=layout_engine) OSError: unknown file format """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "run.py", line 376, in main() File "run.py", line 364, in main ), total=args.count): File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1005, in iter for obj in iterable: File "/Users/galmoore/anaconda/envs/text_rec_task/lib/python3.6/multiprocessing/pool.py", line 735, in next raise value OSError: unknown file format

Hi, I just got some issue, it maybe caused by invalid ttf error. please check the ttf file in the font folder. just use only 1 ttf file for test. if not working, download trustable ttf file and replace it.

ref : https://github.com/parksunwoo/ocr_kor/issues/11 (korean)

Aug 07 '20 08:08 NeighborhoodCoding

The included fonts does not seem to be valid. I am overdue for testing all the fonts in the repo and making sure that they all work. I'll try to find some bandwidth to do that.

In your Korean message, you seem to be missing the cv2 dependency. You will want to install it.

Aug 10 '20 13:08 Belval

TextRecognitionDataGenerator TextRecognitionDataGenerator copied to clipboard

run.py crashes (OSError: unknown file format) when generating dataset for new non- latin language

TextRecognitionDataGenerator
TextRecognitionDataGenerator copied to clipboard