DeepSpeed
DeepSpeed copied to clipboard
[BUG] High memory usage on first GPU, despite perfectly-balanced stages in pipeline
Describe the bug
When using pipelining (with or without LayerSpec inside PipelineModule), the first GPU seems to have a considerably higher memory consumption, compared to the other ones. This is even visible on perfectly balanced ML models (like the model i attach below)
To Reproduce
- extract the train loop in
train.py, the ML model and dataset inbenchmark.py, and the ds config inds_config.json, all zipped inside code.zip- this is a model of 2048 very-simple perfectly-balanced linear layers;
- run
deepspeed --num_gpus=8 train.py --deepspeed --deepspeed_config ds_config.json --pipeline_num_stages 8 --pipeline_spec_layers- this runs the memory efficient (with
SpecLayer) pipeline implementation. Remove--pipeline_spec_layersto run the non-SpecLayerimplementation, and this issue is still visible; - launch with
--pipeline_num_stages Xto define number of stagesX$\in [2, 4, 8]$. The issue is still visible;
- this runs the memory efficient (with
- confirm that stages are memory-balanced:
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=3 STAGE=3 LAYERS=512 [1536, 2048) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=6 STAGE=6 LAYERS=512 [3072, 3584) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=512 [0, 512) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=512 [512, 1024) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=2 STAGE=2 LAYERS=512 [1024, 1536) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=5 STAGE=5 LAYERS=512 [2560, 3072) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=4 STAGE=4 LAYERS=512 [2048, 2560) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=7 STAGE=7 LAYERS=512 [3584, 4096) STAGE_PARAMS=16842752 (16.843M)
- on a different terminal, run
watch nvidia-smiand check the memory usage across GPUs when you start training.
Expected behaviour
Running nvidia-smi should output the memory usage values. A big difference between GPU 0 and the others should be visible:
ds_report output
[2023-10-09 10:53:47,040] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
transformer_inference .. [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['~/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['~/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 125.89 GB
System info:
- Ubuntu 20.04.6 LTS
deepspeed=0.10.3,torch==2.0.1andtorch.version.cuda==11.7- 1 single compute node, with 8x NVIDIA GeForce GTX TITAN X.
I have the same issue here, is there any solution for it?
Same issue here, any solutions now? :)