tessdata icon indicating copy to clipboard operation
tessdata copied to clipboard

Vedic Sanskrit Traineddata for 4.0

Open Shreeshrii opened this issue 8 years ago • 10 comments

  • See https://github.com/Shreeshrii/imagessan/tree/master/groundtruthimages for the images used for testing.

san.traineddata in this repo (4.0 alpha version) gives the following accuracy for the above sample of images:

Character/Word Error Rate %
CER 11.55
WER 8.85
WER (order independent) 7.44

Improved Accuracy is gained after training using https://github.com/Shreeshrii/tess4training/blob/a09bbe913b25b0e623f1bc267a60b192d8e0ccc6/san.traineddata

Character/Word Error Rate %
CER 3.73
WER 4.71
WER (order independent) 3.82

The newly traineddata accuracy should improve further , I hope, as the training converges further.

Eval Report at https://shreeshrii.github.io/tess4eval-san/

Update:

Further training actually led to lower accuracy on this sample set.

Character/Word Error Rate %
CER 6.65
WER 9.43
WER (order independent) 7.94

Stopped Training with https://github.com/Shreeshrii/tess4training/blob/8dc4f8488e74d4a934168844932c8a526d70c1d9/bihtune.traineddata

Shreeshrii avatar Jul 11 '17 08:07 Shreeshrii

I separately trained for Vedic Sanskrit using text from Rigveda.

The source files are at https://github.com/Shreeshrii/tess4training-vedic

Resulting traineddata files are as follows:

  • https://github.com/Shreeshrii/tess4training/blob/master/docs/vedic.traineddata
  • https://github.com/Shreeshrii/tess4train/blob/master/docs/rig.traineddata

and accuracy on sample page - (https://github.com/Shreeshrii/tess4training/blob/master/scanned.tulasi.exp0.tif) when I stopped training was

  • https://github.com/Shreeshrii/tess4training/blob/master/docs/BRH-test-vedic_out.html
  • https://github.com/Shreeshrii/tess4train/blob/master/docs/BRH-test-rig_out.html

A special thank you to the Travis team for providing the resources for training.

Shreeshrii avatar Jul 20 '17 15:07 Shreeshrii

A user who tested with a sample of 30 pages with the above reported accuracy of 90%.

@theraysmith I hope that the new version of Sanskrit traineddata will be able to OCR both Classical Sanskrit and Vedic Sanskrit. When should we expect the new (4.0.0beta) version of traineddata files?

Shreeshrii avatar Jul 20 '17 15:07 Shreeshrii

Sample of Samaveda Sanskrit text - it uses different set of Vedic accents compared to rigveda sample above. I have not tried training for this.

http://sanskrit.safire.com/image/SamaVeda.gif

http://sanskritweb.net/samaveda/sample.gif

http://vedicreserve.mum.edu/sama_veda/sama_veda.pdf

Update:

Unicode Samaveda text available from http://www.parankusa.org/SamaBrowse.aspx

Shreeshrii avatar Jul 26 '17 06:07 Shreeshrii

how to train help me by step by step process. for Vedic sanskrit

ksdmahesh avatar Mar 03 '19 09:03 ksdmahesh

@ksdmahesh What kind of text do you want to train for? You will needs a large text corpus in utf-8 format and unicode fonts that render it correctly.

Shreeshrii avatar Mar 05 '19 04:03 Shreeshrii

ok thank you. after collection of data. how to start

ksdmahesh avatar Mar 05 '19 04:03 ksdmahesh

Sorry for delay in reply.

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 for the details about training.

I am doing a test training using portions of online texts of rigveda and yajurveda. (I had deleted the githib repos referenced in earlier messages in this thread). I will share the traineddata file when done.

Shreeshrii avatar Mar 29 '19 08:03 Shreeshrii

@Shreeshrii , Good work. How close you are in training rigveda and yajurveda models. can you share traineddata files? Regards Bohar

bohrbrar avatar Jun 22 '20 14:06 bohrbrar

@Shreeshrii please share the trainedata file for vedic pandulipi sanskrit

sd-dwivedi avatar Aug 27 '20 12:08 sd-dwivedi