DeepSpeedExamples
DeepSpeedExamples copied to clipboard
The problem of model parallelism in training
In most cases, I think we need model parallelism more than data parallelism. It is hoped that the model can be trained in parallel, because the current model is very large, 80G graphics card is not enough to load a complete actor model, reference model, critic model and reward model. For example, I used llama 7B as the actor model and reference model, critic model and reward model, and the model could not be fully loaded on the three A100 80G, and OOM appeared.