sanskrit-ocr
sanskrit-ocr copied to clipboard
An OCR for classical Sanskrit document images
sanskrit-ocr
Note: This branch contains code for IndicOCR-v2. For IndicOCR-v1, kindly visit the this branch.
This repository contains code for various OCR models for classical Sanskrit Document Images. For a quick understanding of how to get the IndicOCR and CNN-RNN up and running, kindly continue to read this Readme. For more detailed instructions, visit our Wiki page.
The IndicOCR model and CNN-RNN models are best run on a GPU.
Please cite our paper if you end up using it for your own research.
@InProceedings{Dwivedi_2020_CVPR_Workshops,
author = {Dwivedi, Agam and Saluja, Rohit and Kiran Sarvadevabhatla, Ravi},
title = {An OCR for Classical Indic Documents Containing Arbitrarily Long Words},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2020}
}
Results:
The following table shows the comparitive results for the IndicOCR-v2 model with other state of the art models.
Row | Dataset | Model | Training Config | CER (%) | WER (%) |
---|---|---|---|---|---|
1 | new | IndicOCR-v2 | C3:mix training + real finetune | 3.86 | 13.86 |
2 | new | IndicOCR-v2 | C1:mix training | 4.77 | 16.84 |
3 | new | CNN-RNN | C3:mix training + real finetune | 3.77 | 14.38 |
4 | new | CNN-RNN | C1:mix training | 3.67 | 13.86 |
5 | new | Google-OCR | -- | 6.95 | 34.64 |
6 | new | Ind.senz | -- | 20.55 | 57.92 |
7 | new | Tesseract (Devanagiri) | -- | 13.23 | 52.75 |
8 | new | Tesseract (Sanskrit) | -- | 21.06 | 62.34 |
IndicOCR-v2:
Details:
The code is written in tensorflow framework.
Pre-Trained Models:
Setup:
In the model/attention-lstm directory, run the following commands:
create conda create -n indicOCR python=3.6.10
conda activate indicOCR
conda install pip
pip install -r requirements.txt
Installation:
To install the aocr
(attention-ocr) library, from the model/attention-lstm directory, run:
python setup.py install
tfrecords creation:
Make sure to have/create a .txt
file with every line of the file in the following format:
path/to/image<space>annotation
ex: /user/sanskrit-ocr/datasets/train/1.jpg I am the annotated text
aocr dataset /path/to/txt/file/ /path/to/data.tfrecords
Train:
To train the data.tfrecords
file created as described above, run the following command.
CUDA_VISIBLE_DEVICES=0 aocr train /path/to/tfrecords/file --batch-size <batch-size> --max-width <max-width> --max-height <max-height> --max-prediction <max-predicted-label-length> --num-epoch <num-epoch>
Validate:
To validate many checkpoints, run
python ./model/evaluate/attention_predictions.py <initial_ckpt_no> <final_ckpt_step> <steps_per_checkpoint>
This will create a val_preds.txt
file in the model/attention-lstm/logs folder.
Test
To test a single checkpoint, run the following command:
CUDA_VISIBLE_DEVICES=0 aocr test /path/to/test.tfrecords --batch-size <batch-size> --max-width <max-width> --max-height <max-height> --max-prediction <max-predicted-label-length> --model-dir ./modelss
Note: If you want to test multiple checkpoints which are evenly spaced (numbering wise), use the method described in the validation section.
Computing Error Rates:
To compute the CER and WER of the predictions, run the following command:
python ./model/evaluate/get_errorrates.py <predicted_file_name>
ex: python model/evaluate/get_errorrates.py val_preds.txt
The results of error rates will be written to a file output.json
in the visualize
directory.
CNN-RNN:
Details:
The code is written in tensorflow framework.
Pre-Trained Models:
To download the best CNN-RNN model, kindly visit this page.
Setup:
In the model/CNN-RNN directory, run the following commands:
create conda create -n crnn python=3.6.10
conda activate crnn
conda install pip
pip install -r requirements.txt
tfrecords creation:
Make sure to have/create a .txt
file with every line of the file in the following format:
path/to/image<space>annotation
ex: /user/sanskrit-ocr/datasets/train/1.jpg I am the annotated text
python model/CRNN/create_tfrecords.py /path/to/.txt/file ./model/CRNN/data/tfReal/data.tfrecords
Train:
To train the data.tfrecords
file created as described above, run the following command.
python model/CRNN/train.py <training tfrecords filename> <train_epochs> <path_to_previous_saved_model> <steps-per_checkpoint>
ex: python ./model/CRNN/train.py train_feature.tfrecords 20 model/CRNN/model/shadownet/shadownet_-40 200
Note: If you are training from scratch just set the <path_to_previous_saved_model>
arguement to 0.
ex: python model/CRNN/train.py data.tfrecords 100 0 <steps-per_checkpoint>
Validate:
To validate many checkpoints, run
python ./model/evaluate/crnn_predictions.py <tfrecords_file_name> <initial_step> <final_step> <steps_per_checkpoint> <out_file>
This will create a out_file
in the model/CRNN/logs folder.
Note: the tfrecords_file_name
should be relative to the model/CRNN/data/tfReal/
directory.
Test
Same as above
Computing Error Rates:
To compute the CER and WER of the predictions, run the following command:
Validation:
python model/evaluate/get_errorrates_crnn.py <path_to_predicted_file>
Test:
python model/evaluate/get_errorrates_crnn.py <path_to_predicted_file>
ex: python model/evaluate/get_errorrates_crnn.py model/CRNN/logs/test_preds_final.txt
Creating Synthetic Data, Obtaining results for Tesseract and Google-OCR etc.
Visit our Wiki page.
Other-Analysis:
WA-ECR Plot:
To gain a better insight into performance, we compute the word-averaged erroneous character rate (WA-ECR). This is defined as follows:
WA-ECR: = E/N
Where:
- E: number of erroneous characters across all words of length L
- N : number of L length words in the test set.
Figure: Distribution of word-averaged erroneous character rate (WA-ECR) as a function of length, for different models. The lower WA-ECR the better. The test words histogram in terms of word lengths can also be seen in the plot (red dots, log scale).
Sample Results:
Figure: Qualitative results for different models. Errors relative to ground truth are highlighted in red. Blue highlighting indicates text missing from at least one of the OCRs. A larger amount of blue within a line for an OCR indicates better coverage relative to others OCRs. Smaller amount of red indicates absence of errors.