llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

pre-training recipe

Open enpassanty opened this issue 11 months ago • 4 comments

Any plans to add a recipe for further pre-training on custom data with optional tokenizer vocab extension in the style of chinese-llama? would love to see that. thanks

enpassanty avatar Jul 19 '23 06:07 enpassanty

Highly interested in this as well, for domain specific pre-training before instruct/chat finetuning it would be very useful.

maximegmd avatar Jul 19 '23 15:07 maximegmd

Thanks for the suggestions, we welcome contributions from the community.

Also suggest testing out the training script from chinese-llama on the llama 2 model. Based on the description, it could be done in two steps -- fine-tune the base llama 2 (pre-trained) model on alpaca dataset, and then use the scripts from chinese-llama for custom vocab. If you get a chance to try this out, will be great if you can update with your findings.

chauhang avatar Jul 20 '23 04:07 chauhang

Also very interested in this. If I wish to use the same tokenizer and just continue pretraining with in-domain data, is it sufficient to use the script in llama-recipes for finetuning, such as the example of samsum_dataset.py? (Instead of using prompt/response format, feed in free-text)

In essence, what is the difference between continue pretraining with in-domain data and finetuning for prompt/response? Thanks.

jmzeng avatar Aug 17 '23 00:08 jmzeng

@jmzeng sorry for the late reply, there few things here, a-finetuning a base model to make a assistant model this requires prompt/response dataset, b- continued pretraining, follows similar objective to base model but you would need a large amount of data and we are talking scale of billions of tokens. c-fine-tuning a base or fine-tuned model on specific task which will require high quality data on that specific task it can be in simple format or prompt response format.

HamidShojanazeri avatar Dec 13 '23 21:12 HamidShojanazeri

Closing this as we'll likely not add more datasets and the use case of continued full-parameter update is already supported and described. For massive parallel pre-training have a look at TorchTitan.

mreso avatar May 02 '24 04:05 mreso