TinyLlama
TinyLlama copied to clipboard
Does it support spanish language?
Excellent works guys! My question is if this model support the spanish language? or what languages it supports? Can it be trained in spanish? how much time and resources are necessary for this purpose?
Have an excellent day to all the team!
Hi, currently our training datasets mainly contain English corpus. I think that not much spanish are training during pretraining. However, I think that you can collect about over 50B high quality spanish language corpus, mix them with slimpajama corpus and continual pretrain our model with those data.
How must cost the training of a tiny llama in spanish?
Thanks for your answer
How must cost the training of a tiny llama in spanish?
It depends on your token number. For example, you need about half a month for ~250B tokens under 8 A40s.
Considering the actual prices and your estimated time, it's aproximately $ 3.168 dolars.
I am not sure how many tokens required to achieve a good continual pretrained models for spanish. Maybe will be less than 250B. Sorry I have no experience about that.
Thank you for explanations and your awesome model. I have a little question about mixing non-English corpus with slimpajama. Is it mandatory? In which proportion it should be done? If I have a book corpus, can I split be sentence and train on the little context size (32-64?) with the max_length padding?