Vedic Sanskrit Traineddata for 4.0
- See https://github.com/Shreeshrii/imagessan/tree/master/groundtruthimages for the images used for testing.
san.traineddata in this repo (4.0 alpha version) gives the following accuracy for the above sample of images:
| Character/Word Error Rate | % |
|---|---|
| CER | 11.55 |
| WER | 8.85 |
| WER (order independent) | 7.44 |
Improved Accuracy is gained after training using https://github.com/Shreeshrii/tess4training/blob/a09bbe913b25b0e623f1bc267a60b192d8e0ccc6/san.traineddata
| Character/Word Error Rate | % |
|---|---|
| CER | 3.73 |
| WER | 4.71 |
| WER (order independent) | 3.82 |
The newly traineddata accuracy should improve further , I hope, as the training converges further.
Eval Report at https://shreeshrii.github.io/tess4eval-san/
Update:
Further training actually led to lower accuracy on this sample set.
| Character/Word Error Rate | % |
|---|---|
| CER | 6.65 |
| WER | 9.43 |
| WER (order independent) | 7.94 |
Stopped Training with https://github.com/Shreeshrii/tess4training/blob/8dc4f8488e74d4a934168844932c8a526d70c1d9/bihtune.traineddata
I separately trained for Vedic Sanskrit using text from Rigveda.
The source files are at https://github.com/Shreeshrii/tess4training-vedic
Resulting traineddata files are as follows:
- https://github.com/Shreeshrii/tess4training/blob/master/docs/vedic.traineddata
- https://github.com/Shreeshrii/tess4train/blob/master/docs/rig.traineddata
and accuracy on sample page - (https://github.com/Shreeshrii/tess4training/blob/master/scanned.tulasi.exp0.tif) when I stopped training was
- https://github.com/Shreeshrii/tess4training/blob/master/docs/BRH-test-vedic_out.html
- https://github.com/Shreeshrii/tess4train/blob/master/docs/BRH-test-rig_out.html
A special thank you to the Travis team for providing the resources for training.
A user who tested with a sample of 30 pages with the above reported accuracy of 90%.
@theraysmith I hope that the new version of Sanskrit traineddata will be able to OCR both Classical Sanskrit and Vedic Sanskrit. When should we expect the new (4.0.0beta) version of traineddata files?
Sample of Samaveda Sanskrit text - it uses different set of Vedic accents compared to rigveda sample above. I have not tried training for this.
http://sanskrit.safire.com/image/SamaVeda.gif
http://sanskritweb.net/samaveda/sample.gif
http://vedicreserve.mum.edu/sama_veda/sama_veda.pdf
Update:
Unicode Samaveda text available from http://www.parankusa.org/SamaBrowse.aspx
how to train help me by step by step process. for Vedic sanskrit
@ksdmahesh What kind of text do you want to train for? You will needs a large text corpus in utf-8 format and unicode fonts that render it correctly.
ok thank you. after collection of data. how to start
Sorry for delay in reply.
See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 for the details about training.
I am doing a test training using portions of online texts of rigveda and yajurveda. (I had deleted the githib repos referenced in earlier messages in this thread). I will share the traineddata file when done.
@Shreeshrii , Good work. How close you are in training rigveda and yajurveda models. can you share traineddata files? Regards Bohar
@Shreeshrii please share the trainedata file for vedic pandulipi sanskrit