students
students copied to clipboard
Training speed Teacher to Student
Is there any data that you can share on how long it took to train the student models with the recommended setup of 4 GPUs with 12GB memory (What GPU series and model are we talking about here..?)
Months, weeks, days?
I'm interested in potentially contributing in the future but I'd need to know what to expect before getting hardware to do so.
Cheers :)
It really depends on the size of the data used for distillation because the generation of n-best candidates takes a significant amount of the time. If you are distilling from a single transformer big in a mid-size language (5M to 40M sentences), I would say 1 week with a 12GB GPU. I trained some models with 2080ti 12GB and it's affordable, unless you train from a 2x or 4x ensemble of transformer bigs for English-French.
EDIT: but if you are asking for a GPU model because you want to buy someone, I'd suggest to go for one of the newest gen with more RAM than that. Maybe with a 4090 you can do all the work I mentioned in half a week.
Awesome, thank you very much for that info. I do not at the moment have the fortune to spend that kind of money for a 4090 or anything close to it, but hopefully that will change in the next 6 months or so. I'll be back.