unilm how to train TrOCR for a new Language

hey there, i need TrOCR to work for Kannada langauage, i was able to find kannada BERT model on the huggingface hub, how to i train TrOCR for kannada and how do i generate a dataset to train the model ? any references would be highly appreciated.

thanks

Feb 01 '22 08:02 StephennFernandes

@StephennFernandes Basically, you need to prepare the training data for Kannada. If you have any documents written in Kannada, you may use that. Otherwise, you can generate the training data using Wikipedia or other digital-born documents.

Feb 03 '22 13:02 wolfshow

hi, where to download pretrained models for Japanese, Korean, etc.? [email protected]

Feb 14 '22 14:02 nissansz

hey there, i need TrOCR to work for Kannada langauage, i was able to find kannada BERT model on the huggingface hub, how to i train TrOCR for kannada and how do i generate a dataset to train the model ? any references would be highly appreciated.

thanks

fine-tune is ok . trocr is tokenizer with BPE.

Mar 07 '22 07:03 wenyinlong

Hello @wenyinlong, What do you mean just do fine-tune is ok because trocr tokenizer with BPE?. I want to training handwritten text in indonesian language, I thought because encoder decoder in TrOCR was train in english word, text data in other languages will be difficult to detect properly. Would you like to explain it?

Apr 28 '22 04:04 dhea1323

@StephennFernandes Basically, you need to prepare the training data for Kannada. If you have any documents written in Kannada, you may use that. Otherwise, you can generate the training data using Wikipedia or other digital-born documents.

I have a ton of text corpuses available for Kannada as well as other Indian languages. But how should the data be / preprocessed in order to train the TrOCR model on other Languages ?

How should the sample pairs be generated. Using the text coprus how to produce line by line level sample pairs (i belive that's how TrOCR was trained on)

Apr 28 '22 08:04 StephennFernandes

unilm unilm copied to clipboard

how to train TrOCR for a new Language

unilm
unilm copied to clipboard