DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

DeepSpeed not using all available GPUs

Open sarrahbbh opened this issue 1 year ago • 0 comments

Describe the bug Hello, I'm trying to train a 13b language model on 2 A100 80GB GPUs using ZeRO-2, but DeepSpeed doesn't seem to be using both GPUs for the training as CUDA immediately runs out of memory after loading the model and starting the training loop.

Expected behavior 160GB should be more than enough to train a 13b model especially in a ZeRO-2 setting

ds_report output Here's the ds_report output: ds report

Screenshots Here's a screenshot of the deepspeed initial logs, which indicate that indeed two CUDA devices are set: logs

Here's a screenshot of the CUDA memory error which indicate that not even the first GPU is used to its full capacity cuda

System info (please complete the following information):

  • GPU count and types: one machine with 2 A100s
  • Environment: Jupyter notebook

Launcher context The experiment is launched using deepspeed (deepspeed finetune_model.py, as all other arguments including the config file are set by default in the training file)

Docker context Are you using a specific docker image that you can share? No

Your help is most appreciated! x(

sarrahbbh avatar May 11 '23 12:05 sarrahbbh