TinyLlama icon indicating copy to clipboard operation
TinyLlama copied to clipboard

Does it support spanish language?

Open wilfoderek opened this issue 1 year ago • 7 comments

Excellent works guys! My question is if this model support the spanish language? or what languages it supports? Can it be trained in spanish? how much time and resources are necessary for this purpose?

Have an excellent day to all the team!

wilfoderek avatar Jan 17 '24 14:01 wilfoderek

Hi, currently our training datasets mainly contain English corpus. I think that not much spanish are training during pretraining. However, I think that you can collect about over 50B high quality spanish language corpus, mix them with slimpajama corpus and continual pretrain our model with those data.

ChaosCodes avatar Feb 08 '24 14:02 ChaosCodes

How must cost the training of a tiny llama in spanish?

wilfoderek avatar Feb 08 '24 15:02 wilfoderek

Thanks for your answer

wilfoderek avatar Feb 08 '24 15:02 wilfoderek

How must cost the training of a tiny llama in spanish?

It depends on your token number. For example, you need about half a month for ~250B tokens under 8 A40s.

ChaosCodes avatar Feb 08 '24 17:02 ChaosCodes

Considering the actual prices and your estimated time, it's aproximately $ 3.168 dolars.

wilfoderek avatar Feb 08 '24 17:02 wilfoderek

I am not sure how many tokens required to achieve a good continual pretrained models for spanish. Maybe will be less than 250B. Sorry I have no experience about that.

ChaosCodes avatar Feb 08 '24 17:02 ChaosCodes

Thank you for explanations and your awesome model. I have a little question about mixing non-English corpus with slimpajama. Is it mandatory? In which proportion it should be done? If I have a book corpus, can I split be sentence and train on the little context size (32-64?) with the max_length padding?

demetera avatar Feb 13 '24 12:02 demetera