llama-recipes pre-training recipe

Any plans to add a recipe for further pre-training on custom data with optional tokenizer vocab extension in the style of chinese-llama? would love to see that. thanks

Jul 19 '23 06:07 enpassanty

Highly interested in this as well, for domain specific pre-training before instruct/chat finetuning it would be very useful.

Jul 19 '23 15:07 maximegmd

Thanks for the suggestions, we welcome contributions from the community.

Also suggest testing out the training script from chinese-llama on the llama 2 model. Based on the description, it could be done in two steps -- fine-tune the base llama 2 (pre-trained) model on alpaca dataset, and then use the scripts from chinese-llama for custom vocab. If you get a chance to try this out, will be great if you can update with your findings.

Jul 20 '23 04:07 chauhang

Also very interested in this. If I wish to use the same tokenizer and just continue pretraining with in-domain data, is it sufficient to use the script in llama-recipes for finetuning, such as the example of samsum_dataset.py? (Instead of using prompt/response format, feed in free-text)

In essence, what is the difference between continue pretraining with in-domain data and finetuning for prompt/response? Thanks.

Aug 17 '23 00:08 jmzeng

@jmzeng sorry for the late reply, there few things here, a-finetuning a base model to make a assistant model this requires prompt/response dataset, b- continued pretraining, follows similar objective to base model but you would need a large amount of data and we are talking scale of billions of tokens. c-fine-tuning a base or fine-tuned model on specific task which will require high quality data on that specific task it can be in simple format or prompt response format.

Dec 13 '23 21:12 HamidShojanazeri

Closing this as we'll likely not add more datasets and the use case of continued full-parameter update is already supported and described. For massive parallel pre-training have a look at TorchTitan.

May 02 '24 04:05 mreso

llama-recipes llama-recipes copied to clipboard

pre-training recipe

llama-recipes
llama-recipes copied to clipboard