tessdata Vedic Sanskrit Traineddata for 4.0

See https://github.com/Shreeshrii/imagessan/tree/master/groundtruthimages for the images used for testing.

san.traineddata in this repo (4.0 alpha version) gives the following accuracy for the above sample of images:

Character/Word Error Rate	%
CER	11.55
WER	8.85
WER (order independent)	7.44

Improved Accuracy is gained after training using https://github.com/Shreeshrii/tess4training/blob/a09bbe913b25b0e623f1bc267a60b192d8e0ccc6/san.traineddata

Character/Word Error Rate	%
CER	3.73
WER	4.71
WER (order independent)	3.82

The newly traineddata accuracy should improve further , I hope, as the training converges further.

Eval Report at https://shreeshrii.github.io/tess4eval-san/

Update:

Further training actually led to lower accuracy on this sample set.

Character/Word Error Rate	%
CER	6.65
WER	9.43
WER (order independent)	7.94

Stopped Training with https://github.com/Shreeshrii/tess4training/blob/8dc4f8488e74d4a934168844932c8a526d70c1d9/bihtune.traineddata

Jul 11 '17 08:07 Shreeshrii

I separately trained for Vedic Sanskrit using text from Rigveda.

The source files are at https://github.com/Shreeshrii/tess4training-vedic

Resulting traineddata files are as follows:

https://github.com/Shreeshrii/tess4training/blob/master/docs/vedic.traineddata
https://github.com/Shreeshrii/tess4train/blob/master/docs/rig.traineddata

and accuracy on sample page - (https://github.com/Shreeshrii/tess4training/blob/master/scanned.tulasi.exp0.tif) when I stopped training was

https://github.com/Shreeshrii/tess4training/blob/master/docs/BRH-test-vedic_out.html
https://github.com/Shreeshrii/tess4train/blob/master/docs/BRH-test-rig_out.html

A special thank you to the Travis team for providing the resources for training.

Jul 20 '17 15:07 Shreeshrii

A user who tested with a sample of 30 pages with the above reported accuracy of 90%.

@theraysmith I hope that the new version of Sanskrit traineddata will be able to OCR both Classical Sanskrit and Vedic Sanskrit. When should we expect the new (4.0.0beta) version of traineddata files?

Jul 20 '17 15:07 Shreeshrii

Sample of Samaveda Sanskrit text - it uses different set of Vedic accents compared to rigveda sample above. I have not tried training for this.

http://sanskrit.safire.com/image/SamaVeda.gif

http://sanskritweb.net/samaveda/sample.gif

http://vedicreserve.mum.edu/sama_veda/sama_veda.pdf

Update:

Unicode Samaveda text available from http://www.parankusa.org/SamaBrowse.aspx

Jul 26 '17 06:07 Shreeshrii

how to train help me by step by step process. for Vedic sanskrit

Mar 03 '19 09:03 ksdmahesh

@ksdmahesh What kind of text do you want to train for? You will needs a large text corpus in utf-8 format and unicode fonts that render it correctly.

Mar 05 '19 04:03 Shreeshrii

ok thank you. after collection of data. how to start

Mar 05 '19 04:03 ksdmahesh

Sorry for delay in reply.

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 for the details about training.

I am doing a test training using portions of online texts of rigveda and yajurveda. (I had deleted the githib repos referenced in earlier messages in this thread). I will share the traineddata file when done.

Mar 29 '19 08:03 Shreeshrii

@Shreeshrii , Good work. How close you are in training rigveda and yajurveda models. can you share traineddata files? Regards Bohar

Jun 22 '20 14:06 bohrbrar

@Shreeshrii please share the trainedata file for vedic pandulipi sanskrit

Aug 27 '20 12:08 sd-dwivedi