accelerate
accelerate copied to clipboard
Deepspeed seems to be easier to OOM after 0.18.0.
System Info
- `Accelerate` version: 0.18.0
- Platform: Linux-5.4.0-128-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.2
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
Not found
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
I use accelerate+deepspeed to train a llama30b model, and my environment is 8*A100 80G. My batch size=2, tokenlen=2048. My zero_stage=3, the following is the detailed configuration:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
dynamo_backend: 'no'
fsdp_config: {}
machine_rank: 0
megatron_lm_config: {}
mixed_precision: 'fp16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
At the first step, you will get an error:
File "/mnt/data/venv_acc/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 57, in forward
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
This seems to be OOM,This problem will occur in both 0.18.0 and 0.19.0 of accelerate. At this time, I can't train a step unless I set tokenlen to 1024.
When I switch back to 0.17.1, I can train normally with tokenlen=2048. However, during the training, you will be prompted:
1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
This prompt will not appear in the new version of accelerate.
Expected behavior
I tried the latest versions of deepspeed, and I got the same conclusion. What seems to be the configuration change of the new accelerate?
I'm having the same issue as well. This occurs even for models with smaller size like MT0-large.
cc @pacman100
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.