DeepSpeed [BUG]Issues with Running DeepSpeed Zero2 & Zero3 Not Taking Effect

Describe the bug The GPU memory usage has not decreased when training with Deepspeed zero2 & zero3 situation:

The model structure is a combination of CNN and Transformer.
The code utilizes PyTorch's official AMP (Automatic Mixed Precision) and checkpoint.

code: deepspeed.init_distributed()

......

model, optimizer, dataloader, _ = deepspeed.initialize( args=args, model = model.module, training_data=dataset, collate_fn=collate_fn, optimizer=optimizer, lr_scheduler=scheduler, dist_init_required=True )

.......

loss, pred = model(input_dict) model.backward(loss) model.step()

zero2 config: { "train_batch_size": 4, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients" : true, "ignore_unused_parameters": true, "round_robin_gradients": true }, "gradient_checkpointing": { "enabled": true } }

zero3 config： { "train_batch_size": 4, "zero_optimization": { "stage": 3, "allgather_partitions": true, "reduce_scatter": true, "contiguous_gradients": true, "stage3_max_live_parameters" : 5e8, "stage3_max_reuse_distance" : 5e8, "stage3_prefetch_bucket_size" : 2e8, "stage3_param_persistence_threshold" : 1e6, "sub_group_size" : 5e8, "overlap_comm": true, "offload_param": { "device": "cpu", "pin_memory": true } }, "gradient_checkpointing": { "enabled": true }, }

System info (please complete the following information):

OS: Ubuntu 20.04
one machines with x2 4090Ds
Python version: 3.10.12
CUDA version: cuda_12.3.r12.3/compiler.33567101_0
requirements: absl-py 2.0.0 addict 2.4.0 aiohttp 3.9.1 aiosignal 1.3.1 annotated-types 0.6.0 apex 0.1 appdirs 1.4.4 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 asttokens 2.4.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.1.0 audioread 3.0.1 bcrypt 4.2.1 beautifulsoup4 4.12.2 black 24.10.0 bleach 6.1.0 blinker 1.9.0 blis 0.7.11 cachetools 5.3.2 catalogue 2.0.10 ccimport 0.4.4 certifi 2023.11.17 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpathlib 0.16.0 cloudpickle 3.0.0 cmake 3.27.9 colored 2.2.4 coloredlogs 15.0.1 comm 0.2.0 confection 0.1.4 ConfigArgParse 1.7 contourpy 1.2.0 cryptography 44.0.0 cubinlinker 0.3.0+2.gbde7348 cuda-python 12.3.0rc4+8.gcb4e395 cudf 23.10.0 cugraph 23.10.0 cugraph-dgl 23.10.0 cugraph-service-client 23.10.0 cugraph-service-server 23.10.0 cuml 23.10.0 cumm-cu120 0.4.11 cupy-cuda12x 12.2.0 cycler 0.12.1 cymem 2.0.8 Cython 3.0.6 dash 2.18.2 dash-core-components 2.0.0 dash-html-components 2.0.0 dash-table 5.0.0 dask 2023.9.2 dask-cuda 23.10.0 dask-cudf 23.10.0 debugpy 1.8.0 decorator 5.1.1 deepspeed 0.16.3 defusedxml 0.7.1 descartes 1.1.0 dill 0.3.9 distributed 2023.9.2 dm-tree 0.1.8 easydict 1.13 einops 0.7.0 exceptiongroup 1.2.0 execnet 2.0.2 executing 2.0.1 expecttest 0.1.3 fastjsonschema 2.19.0 fastrlock 0.8.2 filelock 3.13.1 filterpy 1.4.5 fire 0.7.0 flake8 7.1.1 flash-attn 2.0.4 Flask 2.3.2 flatbuffers 24.12.23 fonttools 4.46.0 frozenlist 1.4.0 fsspec 2023.12.0 gast 0.5.4 google-auth 2.25.0 google-auth-oauthlib 0.4.6 graphsurgeon 0.4.6 grpcio 1.59.3 hjson 3.1.0 humanfriendly 10.0 hypothesis 5.35.1 ibasis 0.0.2 idna 3.6 importlib-metadata 7.0.0 iniconfig 2.0.0 intel-openmp 2021.4.0 iopath 0.1.10 ipykernel 6.27.1 ipython 8.18.1 ipython-genutils 0.2.0 ipywidgets 8.1.5 itsdangerous 2.2.0 jedi 0.19.1 Jinja2 3.1.2 joblib 1.3.2 json5 0.9.14 jsonschema 4.20.0 jsonschema-specifications 2023.11.2 jupyter 1.1.1 jupyter_client 8.6.0 jupyter-console 6.6.3 jupyter_core 5.5.0 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab_pygments 0.3.0 jupyterlab-server 1.2.0 jupyterlab_widgets 3.0.13 jupytext 1.16.0 kiwisolver 1.4.5 kornia 0.8.0 kornia_rs 0.1.8 langcodes 3.3.0 lark 1.2.2 lazy_loader 0.3 librosa 0.10.1 llvmlite 0.40.1 locket 1.0.0 loguru 0.7.3 lyft-dataset-sdk 0.0.8 Mako 1.3.8 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 matplotlib-inline 0.1.6 mccabe 0.7.0 mdit-py-plugins 0.4.0 mdurl 0.1.2 mistune 3.0.2 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 5.1.0 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 multiprocess 0.70.17 murmurhash 1.0.10 mypy-extensions 1.0.0 nbclient 0.9.0 nbconvert 7.12.0 nbformat 5.7.0 nest-asyncio 1.5.8 networkx 2.6.3 ninja 1.11.1.1 notebook 6.4.10 numba 0.57.1+1.g4157f3379 numpy 1.24.4 nuscenes-devkit 1.1.9 nvfuser 0.1.1+gitunknown nvidia-dali-cuda120 1.32.0 nvidia-ml-py 12.560.30 nvidia-pyindex 1.0.9 nvtx 0.2.5 oauthlib 3.2.2 onnx 1.15.0rc2 onnx-graphsurgeon 0.5.2 onnxruntime 1.20.1 open3d 0.17.0 opencv 4.7.0 opencv-python 4.5.5.62 optree 0.10.0 packaging 23.2 pandas 1.5.3 pandocfilters 1.5.0 paramiko 3.5.0 parso 0.8.3 partd 1.4.1 pathspec 0.12.1 pccm 0.4.16 pexpect 4.9.0 Pillow 9.5.0 pip 23.3.1 platformdirs 4.1.0 plotly 5.24.1 pluggy 1.3.0 ply 3.11 polygraphy 0.49.1 pooch 1.8.0 portalocker 3.1.1 preshed 3.0.9 prettytable 3.9.0 prometheus-client 0.19.0 prompt-toolkit 3.0.41 protobuf 4.24.4 psutil 5.9.4 ptxcompiler 0.8.1+2.g5ad1474 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 12.0.1 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pybind11 2.11.1 pybind11-global 2.11.1 pycocotools 2.0.8 pycodestyle 2.12.1 pycparser 2.21 pycuda 2024.1 pydantic 2.5.2 pydantic_core 2.14.5 pyflakes 3.2.0 Pygments 2.17.2 pylibcugraph 23.10.0 pylibcugraphops 23.10.0 pylibraft 23.10.0 PyNaCl 1.5.0 pynvml 11.4.1 pyparsing 3.1.1 pyquaternion 0.9.9 pytest 7.4.3 pytest-flakefinder 1.1.0 pytest-rerunfailures 13.0 pytest-shard 0.1.2 pytest-xdist 3.5.0 python-dateutil 2.8.2 python-hostlist 1.23.0 python-lzf 0.2.6 pytools 2024.1.21 pytorch-quantization 2.1.2 pytorch3d 0.7.8 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.2 raft-dask 23.10.0 referencing 0.31.1 regex 2023.10.3 requests 2.31.0 requests-oauthlib 1.3.1 retrying 1.3.4 rich 13.7.0 rmm 23.10.0 rpds-py 0.13.2 rsa 4.9 scikit-learn 1.2.0 scipy 1.11.4 Send2Trash 1.8.2 setuptools 68.2.2 shapely 2.0.3 six 1.16.0 smart-open 6.4.0 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.5 soxr 0.3.7 spacy 3.7.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 spconv-cu120 2.3.6 sphinx-glpi-theme 0.4.1 srsly 2.4.8 stack-data 0.6.3 sympy 1.12 tabulate 0.9.0 tbb 2021.11.0 tblib 3.0.0 tenacity 9.0.0 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorboardX 2.6.2.2 tensorrt 8.6.1 termcolor 2.5.0 terminado 0.18.0 terminaltables 3.1.10 thinc 8.2.1 threadpoolctl 3.2.0 thriftpy2 0.4.17 tinycss2 1.2.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.2.0a0+81ea7a4 torch-scatter 2.1.2 torch-tensorrt 2.2.0a0 torchdata 0.7.0a0 torchtext 0.17.0a0 torchvision 0.17.0a0 tornado 6.4 tqdm 4.66.1 traitlets 5.9.0 transformer-engine 1.1.0+cf6fc89 treelite 3.9.1 treelite-runtime 3.9.1 triton 2.1.0+6e4932c typer 0.9.0 types-dataclasses 0.6.6 typing_extensions 4.8.0 ucx-py 0.34.0 uff 0.6.9 urllib3 1.26.18 wasabi 1.1.2 wcwidth 0.2.12 weasel 0.3.4 webencodings 0.5.1 Werkzeug 2.3.6 wheel 0.42.0 widgetsnbextension 4.0.13 xdoctest 1.0.2 xgboost 1.7.6 yarl 1.9.3 zict 3.0.0 zipp 3.17.0

Feb 12 '25 12:02 fengdian8564

@fengdian8564 - can you share the GPUs you are using (and the GPU memory) and the rough size of your model?

Feb 14 '25 21:02 loadams

@fengdian8564 - can you share the GPUs you are using (and the GPU memory) and the rough size of your model?

Thanks for your reply. My model size is 371 MB. The GPU memory usage is shown below:

1.no deepspeed

2.deepspeed（zero2）

I'm wondering if it might be due to environmental issues? Here is the result of running ds_deport:

Feb 17 '25 15:02 fengdian8564

@fengdian8564 - can you share the GPUs you are using (and the GPU memory) and the rough size of your model?

Hi there, I'm curious to know how things are progressing with the issue. If it's not too much trouble, would you mind updating me on the latest status? Thank you!

Feb 24 '25 08:02 fengdian8564

same issue here, try to offload optimizer (50M in total), no big difference in GPU memory reduction; try to offload optimizer and also params (2B in total), GPU memory usage is still the same.

My ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
dc ..................... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  FP Quantizer is using an untested triton version (2.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
xxx/anaconda3/envs/xx/compiler_compat/ld: cannot find -lcufile: No such file or directory
collect2: error: ld returned 1 exit status
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['xxx/lib/python3.8/site-packages/torch']
torch version .................... 2.1.0+cu118
deepspeed install path ........... ['xxx/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.16.7, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.4, cuda 12.1
shared memory (/dev/shm) size .... 251.79 GB

May 08 '25 15:05 ziyannchen