litgpt Add TinyStories to the pretraining docs

As far as I know, one can pretrain fine using TinyStories:

litgpt/pretrain.py --data litgpt.data.TinyStories

Should we add this to the documentation?

Right now, we only have a pretrain_tinyllama.md doc that pretrains on SlimPajama and StarCoder.

What I propose is

create a general pretraining.md
inside this document
- explain the general pretraining script
- add the pretrain_tinyllama.md contents as a section there
- add a TinyStories section

What do you think?

Mar 11 '24 19:03 rasbt

Overall sounds good to me. This dataset is mainly for debugging. We could replace the "debug" config in https://github.com/Lightning-AI/litgpt/tree/wip/config_hub/pretrain with it. But it might be better to address https://github.com/Lightning-AI/litgpt/issues/1085 first

Mar 11 '24 22:03 carmocca

Yes exactly. I just wrote in the other issue:

Btw I think having something like TinyStories is super valuable for trying things out. The other datasets (1.2T!) are much too large unless you are serious and committed to doing a big pretraining run.

The use case is internal testing but also in general to have a simple template to try things out, and maybe bringing their own custom dataset.

Mar 11 '24 22:03 rasbt

Adrian suggests that this is done together with a Studio that includes the pretokenized data

Mar 14 '24 16:03 carmocca