min-LLM
min-LLM copied to clipboard
High Level Plan for the Journey!
I'd like to document my current thinking of how I'll get to a final set of pre-trained weights, for a large(ish) transformer model.
The plan will probably need multiple edits, and just from initial conversations with my epic friend @blefaudeux I've learnt a lot about the right/wrong approaches.
Exploration
- Understand what I'd like to model and the data. Do we want to train a pure English model? Something multi-lingual like BigScience? i. How many tokens do we require to train a model?
- What is the model we're going to train? i. Given how popular the decoder only Transformer is, I think this is a no brainer
- Understand how to measure how long we'll need to train a model for a variety of model sizes. Training transformer models seem to be relatively fixed in terms of training regime, hyper-parameters and whatnot (and maybe I'm horribly wrong here!)
- Understand the optimizations (xformers + DeepSpeed) to help us reach optimum training times
Resources
- What resources do I have? What should the constraint be? Can we potentially get sponsors?
I believe (and after speaking to @blefaudeux) that we should really try to limit to one machine. One reason is that once inter-node connect is included things become complicated + how often do people have access to multi-node setups that are ready to go? I for one know the pains of setting multi-node hardware and software up and it's still in progress!