accelerate Deepspeed seems to be easier to OOM after 0.18.0.

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-5.4.0-128-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.2
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I use accelerate+deepspeed to train a llama30b model, and my environment is 8*A100 80G. My batch size=2, tokenlen=2048. My zero_stage=3, the following is the detailed configuration:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
dynamo_backend: 'no'
fsdp_config: {}
machine_rank: 0
megatron_lm_config: {}
mixed_precision: 'fp16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

At the first step, you will get an error:

  File "/mnt/data/venv_acc/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

This seems to be OOM，This problem will occur in both 0.18.0 and 0.19.0 of accelerate. At this time, I can't train a step unless I set tokenlen to 1024. When I switch back to 0.17.1, I can train normally with tokenlen=2048. However, during the training, you will be prompted: 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time This prompt will not appear in the new version of accelerate.

Expected behavior

I tried the latest versions of deepspeed, and I got the same conclusion. What seems to be the configuration change of the new accelerate?

May 28 '23 02:05 piekey1994

I'm having the same issue as well. This occurs even for models with smaller size like MT0-large.

May 29 '23 03:05 trieunus

cc @pacman100

May 30 '23 13:05 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 27 '23 15:06 github-actions[bot]

accelerate accelerate copied to clipboard

Deepspeed seems to be easier to OOM after 0.18.0.

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard