Add TinyStories to the pretraining docs
As far as I know, one can pretrain fine using TinyStories:
litgpt/pretrain.py --data litgpt.data.TinyStories
Should we add this to the documentation?
Right now, we only have a pretrain_tinyllama.md doc that pretrains on SlimPajama and StarCoder.
What I propose is
- create a general
pretraining.md - inside this document
- explain the general pretraining script
- add the
pretrain_tinyllama.mdcontents as a section there - add a TinyStories section
What do you think?
Overall sounds good to me. This dataset is mainly for debugging. We could replace the "debug" config in https://github.com/Lightning-AI/litgpt/tree/wip/config_hub/pretrain with it. But it might be better to address https://github.com/Lightning-AI/litgpt/issues/1085 first
Yes exactly. I just wrote in the other issue:
Btw I think having something like TinyStories is super valuable for trying things out. The other datasets (1.2T!) are much too large unless you are serious and committed to doing a big pretraining run.
The use case is internal testing but also in general to have a simple template to try things out, and maybe bringing their own custom dataset.
Adrian suggests that this is done together with a Studio that includes the pretokenized data