tesstrain
tesstrain copied to clipboard
[Python] [Pytesseract] [Urdu] [Segmentation fault] [Deserialize header failed]
Hi All,
I'm having trouble executing the fine-tunning on this repository. Below is my code which I run on my Jupyter notebook:
**Step1:**
!git clone https://github.com/tesseract-ocr/tesstrain.git
Step-2:
%cd tesstrain
!make tesseract-langdata
**Step-3:**
import zipfile
with zipfile.ZipFile('/content/tesstrain/irt-ground-truth.zip', 'r') as zip_ref:
zip_ref.extractall('/content/tesstrain/data')
**Step-4:**
# Create the directory 'usr/share/tessdata'
!mkdir -p usr/share/tessdata
# Download the trained data file and save it to 'usr/share/tessdata'
!wget -P usr/share/tessdata https://github.com/tesseract-ocr/tessdata_best/raw/main/urd.traineddata
**Step-5:**
!pip install Pillow>=6.2.1
!pip install python-bidi>=0.4
!pip install matplotlib
!pip install pandas
!pip install pytesseract
!apt-get install tesseract-ocr-urd
!apt-get install tesseract-ocr
!make leptonica tesseract
Step-6: I have replaced /content/tesstrain/data/irt/list.train folder with my file which contains below text:
/content/tesstrain/data/irt-ground-truth/page_10_line_1.png نقش فریادی ہے کس کی شوخیٔ تحریر کا /content/tesstrain/data/irt-ground-truth/page_10_line_2.png کاغذی ہے پیرہن ہر پیکر تصویر کا /content/tesstrain/data/irt-ground-truth/page_10_line_3.png کاو کاو سخت جانی ہائے تنہائی نہ پوچھ /content/tesstrain/data/irt-ground-truth/page_10_line_4.png صبح کرنا شام کا لانا ہے جوئے شیر کا /content/tesstrain/data/irt-ground-truth/page_10_line_5.png جذبۂ بے اختیار شوق دیکھا چاہیے /content/tesstrain/data/irt-ground-truth/page_10_line_6.png سینۂ شمشیر سے باہر ہے دم شمشیر کا /content/tesstrain/data/irt-ground-truth/page_10_line_7.png آگہی دام شنیدن جس قدر چاہے بچھائے /content/tesstrain/data/irt-ground-truth/page_10_line_8.png مدعا عنقا ہے اپنے عالم تقریر کا /content/tesstrain/data/irt-ground-truth/page_10_line_9.png نبسکہ ہوں غالبؔ اسیری میں بھی آتش زیر پا /content/tesstrain/data/irt-ground-truth/page_10_line_10.png موئے آتش دیدہ ہے حلقہ مری زنجیر کا
**Step-7:**
# Giving Read/Write rights on tesstrain folder
import os
import subprocess
folder_path = '/content/tesstrain'
# Define the chmod command as a list of arguments
chmod_command = ['chmod', '-R', '777', folder_path]
# Execute the chmod command
try:
subprocess.run(chmod_command, check=True)
print(f"Permissions changed for {folder_path}")
except subprocess.CalledProcessError as e:
print(f"Error: {e}")
Step8:
# /content/tesstrain Path to run the below code
!make training MODEL_NAME=irt START_MODEL=urd FINETUNE_TYPE=Impact
Step8 OutCome:
You are using make version: 4.3
lstmtraining
--debug_interval 0
--traineddata data/irt/irt.traineddata
--old_traineddata /content/tesstrain/usr/share/tessdata/urd.traineddata
--continue_from data/urd/irt.lstm
--learning_rate 0.0001
--model_output data/irt/checkpoints/irt
--train_listfile data/irt/list.train
--eval_listfile data/irt/list.eval
--max_iterations 10000
--target_error_rate 0.01
Loaded file data/urd/irt.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 129 to 129!
Num (Extended) outputs,weights in Series:
1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys64:64, 20736
Lfx96:96, 61824
Lrx96:96, 74112
Lfx384:384, 738816
Fc129:129, 49665
Total weights = 945313
Previous null char=2 mapped to 128
Continuing from data/urd/irt.lstm
Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_1.png نقش فریادی ہے کس کی شوخیٔ تحریر کا
Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_2.png کاغذی ہے پیرہن ہر پیکر تصویر کا
Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_5.png جذبۂ بے اختیار شوق دیکھا چاہیے
Load of page 0 failed!
Load of images failed!!
make: *** [Makefile:327: data/irt/checkpoints/irt_checkpoint] Segmentation fault (core dumped)
Please help me how to proceed further. I'm stuck.
Thanks you
How is this related to Python and pytesseract? By the way: GitHub allows formatting code sections as code to improve readability (just use the <>
button after marking the corresponding lines).
Also, it seems you try to run training on some platform (kaggle?) - run it on your local computer Linux/WSL or Mac. Next do not report problems with your data - first, make sure that example data training works (e.g. you install and set training env correctly )
Hi @zdenop,
I'm running it on Jupyter Notebook. I started with a single page that contained 10 lines only.
Hi @stefan6419846,
I'm working on Jupyter notebook for python and writing the code in it. Moreover, I have also made the code more readable as you suggested.
Thanks
Follow readme instruction - only supported training process. Jupyter notebook is not there. Otherwise you will not get support and issue will be closed.