Pre-training donut for reading cyrillic text
@gwkrsrch thank you for a great project!
Could you please help me with pre-training donut for cyrillic text? For data generation i am using SynthDog. How much data will be enough for pre-training stage? Also I noticed the problem with tokenizer, I got some Chinesse and other not cyrillic characters in predictions. Am I supposed to retrain tokenizer in decoder? Or can I use AutoTokenizer.from_pretrained('ukr-models/uk-ner') ? If you have other tips like configs for pre-training for new language please let me know.
Thanks in advance!
@Invalid-coder Hi, I trained Donut with SynthDog on cyrillic text. You can see results in this thread. https://github.com/clovaai/donut/issues/161#issuecomment-1483755275