min-LLM icon indicating copy to clipboard operation
min-LLM copied to clipboard

High Level Plan for the Journey!

Open SeanNaren opened this issue 2 years ago • 0 comments

I'd like to document my current thinking of how I'll get to a final set of pre-trained weights, for a large(ish) transformer model.

The plan will probably need multiple edits, and just from initial conversations with my epic friend @blefaudeux I've learnt a lot about the right/wrong approaches.

Exploration

  1. Understand what I'd like to model and the data. Do we want to train a pure English model? Something multi-lingual like BigScience? i. How many tokens do we require to train a model?
  2. What is the model we're going to train? i. Given how popular the decoder only Transformer is, I think this is a no brainer
  3. Understand how to measure how long we'll need to train a model for a variety of model sizes. Training transformer models seem to be relatively fixed in terms of training regime, hyper-parameters and whatnot (and maybe I'm horribly wrong here!)
  4. Understand the optimizations (xformers + DeepSpeed) to help us reach optimum training times

Resources

  1. What resources do I have? What should the constraint be? Can we potentially get sponsors?

I believe (and after speaking to @blefaudeux) that we should really try to limit to one machine. One reason is that once inter-node connect is included things become complicated + how often do people have access to multi-node setups that are ready to go? I for one know the pains of setting multi-node hardware and software up and it's still in progress!

SeanNaren avatar Apr 12 '22 16:04 SeanNaren