DeepSpeed [BUG] DeepSpeed tries to allocate memory from GPU 0 even though include was set to go with localhost:3,5

DeepSpeed (0.9.0) was started with --include localhost:3,5 so to force the using of GPU 3 and GPU 5 for doing the training. But still the training failed because trying to allocate memory from GPU 0 and 1, any reason why?

2023-04-15 22:25:47 INFO [__main__] [2023-04-15 22:25:47,140] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.

2023-04-15 22:25:47 INFO [__main__] [2023-04-15 22:25:47,206] [INFO] [runner.py:540:main] cmd = /home/tmatup/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMywgNV19 --master_addr=127.0.0.1 --master_port=29500 --module --enable_each_rank_log=None training.trainer --deepspeed /home/tmatup/root/dolly/config/ds_z3_bf16_config.json --epochs 1 --local-output-dir /home/tmatup/models/dolly/training/dolly__1681622742 --per-device-train-batch-size 2 --per-device-eval-batch-size 2 --lr 1e-5

2023-04-15 22:25:49 INFO [__main__] [2023-04-15 22:25:49,992] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [3, 5]}

2023-04-15 22:25:49 INFO [__main__] [2023-04-15 22:25:49,992] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0

2023-04-15 22:25:49 INFO [__main__] [2023-04-15 22:25:49,992] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})

2023-04-15 22:25:49 INFO [__main__] [2023-04-15 22:25:49,993] [INFO] [launch.py:247:main] dist_world_size=2

2023-04-15 22:25:49 INFO [__main__] [2023-04-15 22:25:49,993] [INFO] [launch.py:249:main] Setting **CUDA_VISIBLE_DEVICES=3,5**

...

2023-04-15 22:29:36 ERROR [__main__] Traceback (most recent call last):
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/root/dolly/training/trainer.py", line 274, in <module>
2023-04-15 22:29:36 ERROR [__main__]     main()
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
2023-04-15 22:29:36 ERROR [__main__]     return self.main(*args, **kwargs)
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
2023-04-15 22:29:36 ERROR [__main__]     rv = self.invoke(ctx)
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
2023-04-15 22:29:36 ERROR [__main__]     return ctx.invoke(self.callback, **ctx.params)
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
2023-04-15 22:29:36 ERROR [__main__]     return __callback(*args, **kwargs)
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/root/dolly/training/trainer.py", line 266, in main
2023-04-15 22:29:36 ERROR [__main__]     train(**kwargs)
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/root/dolly/training/trainer.py", line 231, in train
2023-04-15 22:29:36 ERROR [__main__]     trainer.train()
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1527, in train
2023-04-15 22:29:36 ERROR [__main__]     return inner_training_loop(
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1596, in _inner_training_loop
2023-04-15 22:29:36 ERROR [__main__]     deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/transformers/deepspeed.py", line 344, in deepspeed_init
2023-04-15 22:29:36 ERROR [__main__]     deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/deepspeed/__init__.py", line 156, in initialize
2023-04-15 22:29:36 ERROR [__main__]     engine = DeepSpeedEngine(args=args,
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 328, in __init__
2023-04-15 22:29:36 ERROR [__main__]     self._configure_optimizer(optimizer, model_parameters)
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_optimizer
2023-04-15 22:29:36 ERROR [__main__]     self.optimizer = self._configure_zero_optimizer(basic_optimizer)
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1465, in _configure_zero_optimizer
2023-04-15 22:29:36 ERROR [__main__]     optimizer = DeepSpeedZeroOptimizer_Stage3(
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 304, in __init__
2023-04-15 22:29:36 ERROR [__main__]     self._setup_for_real_optimizer()
2023-04-15 22:29:36 ERROR [__main__]   File "/home/tmatup/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 379, in _setup_for_real_optimizer
2023-04-15 22:29:36 ERROR [__main__]     grad_partitions_flat_buffer: Tensor = torch.zeros(sum(p.partition_numel() for p in all_params),
2023-04-15 22:29:36 ERROR [__main__] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.64 GiB (**GPU 0**; 47.54 GiB total capacity; 39.59 GiB already allocated; 5.26GiB free; 41.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Apr 16 '23 05:04 tmatup

Hi @tmatup, within DeepSpeed, we control which devices are visible by setting the CUDA_VISIBLE_DEVICES environment variable, as you can see in the final line in your log. The practical impact of this is that the visible devices are actually linearized starting from 0.

I create the Python session below by setting that GPU=1 is the only visible device in the system. Since this gets linearized, from PyTorch's perspective there is only 1 visible device, and the device has index 0. CleanShot 2023-04-17 at 16 16 01@2x After allocating a tensor and checking nvidia-smi, I can see that even though PyTorch logically thinks it's creating a tensor on GPU=0, the actual GPU that sees the allocation is GPU=1. CleanShot 2023-04-17 at 16 18 46@2x

Apr 17 '23 23:04 cmikeh2

Closing for lack of activity. Please re-open if needed.

May 15 '23 15:05 tjruwase