Pre-training donut for reading cyrillic text

Open Invalid-coder opened this issue 3 years ago • 1 comments

@gwkrsrch thank you for a great project!

Could you please help me with pre-training donut for cyrillic text? For data generation i am using SynthDog. How much data will be enough for pre-training stage? Also I noticed the problem with tokenizer, I got some Chinesse and other not cyrillic characters in predictions. Am I supposed to retrain tokenizer in decoder? Or can I use AutoTokenizer.from_pretrained('ukr-models/uk-ner') ? If you have other tips like configs for pre-training for new language please let me know.

Thanks in advance!

Feb 23 '23 19:02 Invalid-coder

@Invalid-coder Hi, I trained Donut with SynthDog on cyrillic text. You can see results in this thread. https://github.com/clovaai/donut/issues/161#issuecomment-1483755275

Mar 25 '23 07:03 meugeny