DeepSpeed
DeepSpeed copied to clipboard
DeepSpeed not using all available GPUs
Describe the bug Hello, I'm trying to train a 13b language model on 2 A100 80GB GPUs using ZeRO-2, but DeepSpeed doesn't seem to be using both GPUs for the training as CUDA immediately runs out of memory after loading the model and starting the training loop.
Expected behavior 160GB should be more than enough to train a 13b model especially in a ZeRO-2 setting
ds_report output
Here's the ds_report output:
Screenshots
Here's a screenshot of the deepspeed initial logs, which indicate that indeed two CUDA devices are set:
Here's a screenshot of the CUDA memory error which indicate that not even the first GPU is used to its full capacity
System info (please complete the following information):
- GPU count and types: one machine with 2 A100s
- Environment: Jupyter notebook
Launcher context
The experiment is launched using deepspeed (deepspeed finetune_model.py
, as all other arguments including the config file are set by default in the training file)
Docker context Are you using a specific docker image that you can share? No
Your help is most appreciated! x(