TinyLlama Does it support spanish language?

Does it support spanish language?

Open wilfoderek opened this issue 1 year ago • 7 comments

Excellent works guys! My question is if this model support the spanish language? or what languages it supports? Can it be trained in spanish? how much time and resources are necessary for this purpose?

Have an excellent day to all the team!

Jan 17 '24 14:01 wilfoderek

Hi, currently our training datasets mainly contain English corpus. I think that not much spanish are training during pretraining. However, I think that you can collect about over 50B high quality spanish language corpus, mix them with slimpajama corpus and continual pretrain our model with those data.

Feb 08 '24 14:02 ChaosCodes

How must cost the training of a tiny llama in spanish?

Feb 08 '24 15:02 wilfoderek

Thanks for your answer

Feb 08 '24 15:02 wilfoderek

How must cost the training of a tiny llama in spanish?

It depends on your token number. For example, you need about half a month for ~250B tokens under 8 A40s.

Feb 08 '24 17:02 ChaosCodes

Considering the actual prices and your estimated time, it's aproximately $ 3.168 dolars.

Feb 08 '24 17:02 wilfoderek

I am not sure how many tokens required to achieve a good continual pretrained models for spanish. Maybe will be less than 250B. Sorry I have no experience about that.

Feb 08 '24 17:02 ChaosCodes

Thank you for explanations and your awesome model. I have a little question about mixing non-English corpus with slimpajama. Is it mandatory? In which proportion it should be done? If I have a book corpus, can I split be sentence and train on the little context size (32-64?) with the max_length padding?

Feb 13 '24 12:02 demetera

TinyLlama TinyLlama copied to clipboard

Does it support spanish language?

TinyLlama
TinyLlama copied to clipboard