tesseract Tesseract creating tiff image get core dumped and segmentation fault.

I want to create tiff images from 55 fonts using this code:

rm -rf train/*
tesstrain.sh --fonts_dir font \
	--lang fas \
	--noextract_font_properties  --linedata_only \
	--langdata_dir langdata_lstm\
	--tessdata_dir tesseract/tessdata \
	--save_box_tiff \
	--maxpages 500 \
	--fontlist \
	 "IRAban, Regular"          \
	 "IRHoma, Regular"        	\
	 "IRNarges, Regular"      	\
	 "IRTerafik, Bold"   \
	 "IRAmir, Regular"        	\
	 "IRJadid, Regular"        	\
	 "IRNaskh, Regular"			\
	 "IRTerafik, Italic" \
	 "IRArshia, Regular"      	\
	 "IRKamran, Regular"       	\
	 "IRNazanin, Bold" 	\
	 "IRTerafik, Regular" 	  	\
	 "IRBadr, Bold"    	\
	 "IRKhorasan, Regular"     	\
	 "IRNazanin, Italic"	\
	 "IRTitr, Regular"       	\
	  "IRBadr, Italic"   \
	  "IRKoodak, Regular"       \
	  "IRNazanin, Regular"     	\
	 "IRYakout, Bold" 	\
	  "IRBadr, Regular"        	\
	  "IRLotus, Bold"    \
	  "IRNazli, Bold"   	\
	 "IRYakout, Italic"  \
	  "IRCompset, Bold" 	\
	  "IRLotus, Italic"  \
	  "IRNazli, Regular"       	\
	 "IRYakout, Regular"        \
	  "IRCompset, Italic"  	\
	  "IRLotus, Regular"        \
	  "IRPooya, Regular"       	\
	 "IRYekan, Bold"     \
	  "IRCompset, Regular"     	\
	  "IRMaryam, Regular"       \
	  "IRRoya, Bold"    	\
	 "IRYekan, Regular"         \
	  "IRDast Nevis, Regular"   	\
	  "IRMashhad, Regular"      \
	 "IRRoya, Italic"  	\
	 "IRZar, Bold"       \
	  "IRDavat, Regular"       	\
	  "IRMehr, Regular"         \
	  "IRRoya, Regular"        	\
	 "IRZar, Italic"     \
	  "IRElham, Regular"       	\
	  "IRMitra, Bold"    \
	  "IRShiraz, Regular"      	\
	 "IRZar, Regular"           \
	  "IREntezar, Regular"     	\
	  "IRMitra, Italic" \
	  "IRSina, Regular"        	\
	 "IRZeytoon, Regular"       \
	  "IRFarnaz, Regular"      	\
	  "IRMitra, Regular"        \
	  "IRTabassom, Regular"    	\
	--output_dir trainforB

but every time I get segmentation fault and (core dumped) during the process with random error message like double linked list fault or double free or corruption(!prev) for example :

IMG_20220823_235202_669

With Intel Xeon X5670 @ 2.93GHz 24 core and around 50 GB RAM I checked CPU and RAM usage everything is OK but some CPU cores are at 100% and then this happen. I have to mention that when I use google Colab everything is OK but take too much time because of limited resource.

Environment

Tesseract Version: = 4.1.1
Platform: Ubuntu 20.04.4 LTS

Current Behavior:

Get core dumped on creating tiff images for new fonts

Expected Behavior:

Complete creating tiff image for new fonts

Suggested Fix: None

Aug 24 '22 06:08 Aliw7979

Please use recent version of tesseract (5.2) and recent version of training . Old version(including "tesstrain.sh") is not supported due to lack of resources.

Aug 24 '22 07:08 zdenop

I installed latest version and latest version of training with tesstrain.py but got same error Screenshot from 2022-08-24 17-02-20

tessseract --version tesseract 5.2.0-13-g74e22 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found SSE4.1 Found OpenMP 201511 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3 and text2image --version Using CAIRO_FONT_TYPE_FT. Pango version: 1.44.7 5.2.0-13-g74e22

Aug 24 '22 12:08 Aliw7979

Please provide all necessary files for reproducing problem + whole log (not screenshot) of training process.

Aug 24 '22 14:08 zdenop

Since I don't have permission to share data to public I made a private repository and shared it with you including files that is required to reproduce problem. (Mention that I used combine_tessdata -e fas.traineddata fas.lstm before start generating)

Aug 25 '22 10:08 Aliw7979

Could you reproduce the problem?

Aug 29 '22 06:08 Aliw7979

No. Simple I do not have time. Anyway tesseract is opensource, so you should be able to replicate problem with opensource/free to use data, because more testers can join to find the problem.

Aug 30 '22 08:08 zdenop

I found that when I want to generate training files for one font it wont give me any error but with two or more fonts, it gives me core dumped error. log of error: tesstrain.log fonts: fonts.zip

Aug 30 '22 15:08 Aliw7979

I found the solution :D My training_text was too big, I reduce it to around 200KB. But my question is : Does this affect the accuracy?

Aug 30 '22 20:08 Aliw7979

I am not sure if this is a real problem. I see your training_text size is 12M. Example training text has 26M... So maybe there is a lack of resources. It would be great if you can narrow your problem e.g try to check line length or find the block of text that case crash (make training with one font only to seed up process)

Aug 31 '22 08:08 zdenop

I'm working on it. As you can see my sentences in file are too long, sometimes we have new line after 30 words in a sentence. I'm searching for a script code to insert newline after 5 or 6 words, I know I can use python code to do that but with for loop it is not practical with 1 or 2 GB text file so I'm searching for a good solution.

Sep 04 '22 07:09 Aliw7979

I think this issue is duplicate of #2860.

--maxpages 500

Try to change it to much smaller value.

As you can see my sentences in file are too long, sometimes we have new line after 30 words in a sentence.

Yes, the text line should not be too long. Try to limit it to 10-12 words.

Sep 26 '22 15:09 amitdo

1 or 2 GB text

??? Any rationale for using such a big file if google guidance for training is 26Mb text?

Sep 26 '22 16:09 zdenop

tesseract tesseract copied to clipboard

Tesseract creating tiff image get core dumped and segmentation fault.

Environment

Current Behavior:

Expected Behavior:

Suggested Fix: None

tesseract
tesseract copied to clipboard