litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Add TinyStories to the pretraining docs

Open rasbt opened this issue 1 year ago • 3 comments

As far as I know, one can pretrain fine using TinyStories:

litgpt/pretrain.py --data litgpt.data.TinyStories 

Should we add this to the documentation?

Right now, we only have a pretrain_tinyllama.md doc that pretrains on SlimPajama and StarCoder.

What I propose is

  • create a general pretraining.md
  • inside this document
    • explain the general pretraining script
    • add the pretrain_tinyllama.md contents as a section there
    • add a TinyStories section

What do you think?

rasbt avatar Mar 11 '24 19:03 rasbt

Overall sounds good to me. This dataset is mainly for debugging. We could replace the "debug" config in https://github.com/Lightning-AI/litgpt/tree/wip/config_hub/pretrain with it. But it might be better to address https://github.com/Lightning-AI/litgpt/issues/1085 first

carmocca avatar Mar 11 '24 22:03 carmocca

Yes exactly. I just wrote in the other issue:

Btw I think having something like TinyStories is super valuable for trying things out. The other datasets (1.2T!) are much too large unless you are serious and committed to doing a big pretraining run.

The use case is internal testing but also in general to have a simple template to try things out, and maybe bringing their own custom dataset.

rasbt avatar Mar 11 '24 22:03 rasbt

Adrian suggests that this is done together with a Studio that includes the pretokenized data

carmocca avatar Mar 14 '24 16:03 carmocca