ilmulti icon indicating copy to clipboard operation
ilmulti copied to clipboard

Can you provide the method to train using our own corpora using your version of fairseq ?

Open vishnu3741 opened this issue 4 years ago • 3 comments

I normally use indicnlp to tokenize and moses to train the MT but your model is giving better accuracy and can you give an insight into the amount or corpus used to train the model? Thank you.

vishnu3741 avatar Dec 23 '20 05:12 vishnu3741

Perhaps the paper linked below will answer the corpus used.

Regarding the data/training:

jerinphilip avatar Dec 25 '20 13:12 jerinphilip

hey, Is there way to add vocabulary (I mean words) to the model instead of retraining the entire model? can we edit the files in mm-all-iter1 to do this?

vishnu3741 avatar Jan 07 '21 05:01 vishnu3741

This paper might have some useful information, I think. I'd just retrain with new vocabulary, turnaround is approximately 1 day or something on 4 GPUs to start getting reasonable numbers. This one used 1080Tis or 2080Tis.

jerinphilip avatar Jan 07 '21 09:01 jerinphilip