litgpt
litgpt copied to clipboard
Not an issue but a question: How do I pre-train Falcon on completely new language ?
Hi,
I have been pre-training and fine-tuning Vietnamese language on LlaMA-7B, so far so good, we just do not yet have enough computational capacity to move on. We extended the vocab to 50k so that it covers adequately Vietnamese dictionary and such. We pre-train with large Vietnamese corpus nearly 1B tokens (raw text), and then fine-tuning with LoRA using translated Alpaca and a few other downstream tasks.
Can we do the same with Falcon and other model using Lit-Parrot ? Is there any script that we can refer to for this journey.
Thanks, Steve