How to use custom data Preparing Pretraining Datasets?

Open win10ogod opened this issue 1 year ago • 1 comments

Jan 14 '24 03:01 win10ogod

For pretraining, we have this guide here: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/pretrain_tinyllama.md

It explains how to preprocess the datasets for training TinyLlama. The script that does the preprocessing is quite simple, so you could simply adapt that to your own use case. Take a look here: https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/prepare_slimpajama.py

Jan 22 '24 14:01 awaelchli

We also have a new documentation here: https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain.md which might be helpful. Please don't hesitate to reach out if you have additional questions or need help.

Apr 18 '24 19:04 rasbt