litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Not an issue but a question: How do I pre-train Falcon on completely new language ?

Open thusinh1969 opened this issue 1 year ago • 4 comments

Hi,

I have been pre-training and fine-tuning Vietnamese language on LlaMA-7B, so far so good, we just do not yet have enough computational capacity to move on. We extended the vocab to 50k so that it covers adequately Vietnamese dictionary and such. We pre-train with large Vietnamese corpus nearly 1B tokens (raw text), and then fine-tuning with LoRA using translated Alpaca and a few other downstream tasks.

Can we do the same with Falcon and other model using Lit-Parrot ? Is there any script that we can refer to for this journey.

Thanks, Steve

thusinh1969 avatar Jun 16 '23 14:06 thusinh1969