How to use custom data Preparing Pretraining Datasets?
How to use custom data Preparing Pretraining Datasets?
For pretraining, we have this guide here: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/pretrain_tinyllama.md
It explains how to preprocess the datasets for training TinyLlama. The script that does the preprocessing is quite simple, so you could simply adapt that to your own use case. Take a look here: https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/prepare_slimpajama.py
We also have a new documentation here: https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain.md which might be helpful. Please don't hesitate to reach out if you have additional questions or need help.