handwriting-ocr icon indicating copy to clipboard operation
handwriting-ocr copied to clipboard

Effort of re-training model for another language

Open aszego opened this issue 5 years ago • 1 comments

Using your code here in this repo, what effort would it require to train the model for a different latin letter based language like Hungarian? Being a newbie to ML and Python, I can't even estimate this.

Questions that came up in my mind:

  • Is the repo complete in this sense? Ie. any Python development needed for a new language?
  • How much work is to creating the learning dataset? Under the dataset folders there're datasets with a few hundreds of words (e.g. data.zip\raw\breta\cz_raw) to 5 thousand (raw\breta\words) - Is it a few pages full of handwritten text or hundreds?
  • As I understood, your src/data/data_creation helps with creating these labelled images, right?

aszego avatar Mar 01 '19 16:03 aszego

Hi, the situation is not perfect yet. I have already code for training ML models from current datasets. With some code changes you could add your own data and use them for training. Better option would be to use some pretrained model and tune parameters on the new dataset. This approach should give better results bet it requires some coding. I am not sure what is good amount of data for decent recognition. I would say that one standard page is about 300 hundred words. Currently I was labeling the words manually using the scripts, but this approach is a bit slow.

Breta01 avatar Mar 02 '19 19:03 Breta01