unilm icon indicating copy to clipboard operation
unilm copied to clipboard

how to train TrOCR for a new Language

Open StephennFernandes opened this issue 2 years ago • 5 comments

hey there, i need TrOCR to work for Kannada langauage, i was able to find kannada BERT model on the huggingface hub, how to i train TrOCR for kannada and how do i generate a dataset to train the model ? any references would be highly appreciated.

thanks

StephennFernandes avatar Feb 01 '22 08:02 StephennFernandes

@StephennFernandes Basically, you need to prepare the training data for Kannada. If you have any documents written in Kannada, you may use that. Otherwise, you can generate the training data using Wikipedia or other digital-born documents.

wolfshow avatar Feb 03 '22 13:02 wolfshow

hi, where to download pretrained models for Japanese, Korean, etc.? [email protected]

nissansz avatar Feb 14 '22 14:02 nissansz

hey there, i need TrOCR to work for Kannada langauage, i was able to find kannada BERT model on the huggingface hub, how to i train TrOCR for kannada and how do i generate a dataset to train the model ? any references would be highly appreciated.

thanks

fine-tune is ok . trocr is tokenizer with BPE.

wenyinlong avatar Mar 07 '22 07:03 wenyinlong

Hello @wenyinlong, What do you mean just do fine-tune is ok because trocr tokenizer with BPE?. I want to training handwritten text in indonesian language, I thought because encoder decoder in TrOCR was train in english word, text data in other languages will be difficult to detect properly. Would you like to explain it?

dhea1323 avatar Apr 28 '22 04:04 dhea1323

@StephennFernandes Basically, you need to prepare the training data for Kannada. If you have any documents written in Kannada, you may use that. Otherwise, you can generate the training data using Wikipedia or other digital-born documents.

I have a ton of text corpuses available for Kannada as well as other Indian languages. But how should the data be / preprocessed in order to train the TrOCR model on other Languages ?

How should the sample pairs be generated. Using the text coprus how to produce line by line level sample pairs (i belive that's how TrOCR was trained on)

StephennFernandes avatar Apr 28 '22 08:04 StephennFernandes