tesseract
tesseract copied to clipboard
Tesseract creating tiff image get core dumped and segmentation fault.
I want to create tiff images from 55 fonts using this code:
rm -rf train/*
tesstrain.sh --fonts_dir font \
--lang fas \
--noextract_font_properties --linedata_only \
--langdata_dir langdata_lstm\
--tessdata_dir tesseract/tessdata \
--save_box_tiff \
--maxpages 500 \
--fontlist \
"IRAban, Regular" \
"IRHoma, Regular" \
"IRNarges, Regular" \
"IRTerafik, Bold" \
"IRAmir, Regular" \
"IRJadid, Regular" \
"IRNaskh, Regular" \
"IRTerafik, Italic" \
"IRArshia, Regular" \
"IRKamran, Regular" \
"IRNazanin, Bold" \
"IRTerafik, Regular" \
"IRBadr, Bold" \
"IRKhorasan, Regular" \
"IRNazanin, Italic" \
"IRTitr, Regular" \
"IRBadr, Italic" \
"IRKoodak, Regular" \
"IRNazanin, Regular" \
"IRYakout, Bold" \
"IRBadr, Regular" \
"IRLotus, Bold" \
"IRNazli, Bold" \
"IRYakout, Italic" \
"IRCompset, Bold" \
"IRLotus, Italic" \
"IRNazli, Regular" \
"IRYakout, Regular" \
"IRCompset, Italic" \
"IRLotus, Regular" \
"IRPooya, Regular" \
"IRYekan, Bold" \
"IRCompset, Regular" \
"IRMaryam, Regular" \
"IRRoya, Bold" \
"IRYekan, Regular" \
"IRDast Nevis, Regular" \
"IRMashhad, Regular" \
"IRRoya, Italic" \
"IRZar, Bold" \
"IRDavat, Regular" \
"IRMehr, Regular" \
"IRRoya, Regular" \
"IRZar, Italic" \
"IRElham, Regular" \
"IRMitra, Bold" \
"IRShiraz, Regular" \
"IRZar, Regular" \
"IREntezar, Regular" \
"IRMitra, Italic" \
"IRSina, Regular" \
"IRZeytoon, Regular" \
"IRFarnaz, Regular" \
"IRMitra, Regular" \
"IRTabassom, Regular" \
--output_dir trainforB
but every time I get segmentation fault and (core dumped) during the process with random error message like double linked list fault or double free or corruption(!prev) for example :
With Intel Xeon X5670 @ 2.93GHz 24 core and around 50 GB RAM I checked CPU and RAM usage everything is OK but some CPU cores are at 100% and then this happen. I have to mention that when I use google Colab everything is OK but take too much time because of limited resource.
Environment
- Tesseract Version: = 4.1.1
- Platform: Ubuntu 20.04.4 LTS
Current Behavior:
Get core dumped on creating tiff images for new fonts
Expected Behavior:
Complete creating tiff image for new fonts
Suggested Fix: None
Please use recent version of tesseract (5.2) and recent version of training . Old version(including "tesstrain.sh") is not supported due to lack of resources.
I installed latest version and latest version of training with tesstrain.py but got same error
tessseract --version
tesseract 5.2.0-13-g74e22 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found SSE4.1 Found OpenMP 201511 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
and
text2image --version
Using CAIRO_FONT_TYPE_FT. Pango version: 1.44.7 5.2.0-13-g74e22
Please provide all necessary files for reproducing problem + whole log (not screenshot) of training process.
Since I don't have permission to share data to public I made a private repository and shared it with you including files that is required to reproduce problem.
(Mention that I used combine_tessdata -e fas.traineddata fas.lstm
before start generating)
Could you reproduce the problem?
No. Simple I do not have time. Anyway tesseract is opensource, so you should be able to replicate problem with opensource/free to use data, because more testers can join to find the problem.
I found that when I want to generate training files for one font it wont give me any error but with two or more fonts, it gives me core dumped error. log of error: tesstrain.log fonts: fonts.zip
I found the solution :D My training_text was too big, I reduce it to around 200KB. But my question is : Does this affect the accuracy?
I am not sure if this is a real problem. I see your training_text size is 12M. Example training text has 26M... So maybe there is a lack of resources. It would be great if you can narrow your problem e.g try to check line length or find the block of text that case crash (make training with one font only to seed up process)
I'm working on it. As you can see my sentences in file are too long, sometimes we have new line after 30 words in a sentence. I'm searching for a script code to insert newline after 5 or 6 words, I know I can use python code to do that but with for loop it is not practical with 1 or 2 GB text file so I'm searching for a good solution.
I think this issue is duplicate of #2860.
--maxpages 500
Try to change it to much smaller value.
As you can see my sentences in file are too long, sometimes we have new line after 30 words in a sentence.
Yes, the text line should not be too long. Try to limit it to 10-12 words.
1 or 2 GB text
??? Any rationale for using such a big file if google guidance for training is 26Mb text?