DeepSpeed
DeepSpeed copied to clipboard
[BUG] Error "exits with return code -7" when finetuning FLANT5-xxl on 8x A100
1. Bug Description
I am finetuning Flan-T5-xxl with my corpus using DeepSpeed, based on the tutorial. But when I execute 'deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py', when the GPUs have loaded all shards, the process has immediately been terminated, exits with return code -7, and without any further error traceback.
Here are the full console outputs:
$ deepspeed --num_gpus=8 run_seq2seq_deepspeed.py
[2023-02-24 18:10:18,983] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-24 18:10:19,049] [INFO] [runner.py:548:main] cmd = /home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seq2seq_deepspeed.py
[2023-02-24 18:10:22,043] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-02-24 18:10:22,043] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-02-24 18:10:22,043] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-02-24 18:10:22,043] [INFO] [launch.py:162:main] dist_world_size=8
[2023-02-24 18:10:22,043] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:34<00:00, 6.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:37<00:00, 7.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00, 8.11s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:41<00:00, 8.37s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00, 8.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00, 8.74s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:44<00:00, 8.95s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:45<00:00, 9.10s/it]
[2023-02-24 18:12:41,354] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using cuda_amp half precision backend
[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17786
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17787
[2023-02-24 18:13:10,286] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17788
[2023-02-24 18:13:10,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17789
[2023-02-24 18:13:10,620] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17790
[2023-02-24 18:13:10,953] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17791
[2023-02-24 18:13:10,954] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17792
[2023-02-24 18:13:11,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17793
[2023-02-24 18:13:11,901] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python', '-u', 'run_seq2seq_deepspeed.py', '--local_rank=7'] exits with return code = -7
2. Screenshots
- here is the console cursor before being killed:
[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
- here are the GPU and CPU states screenshots before being killed:
3. ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/aiops/xxxx/.miniconda3/envs/trm-pt-py39/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/home/aiops/xxxx/.miniconda3/envs/trm-pt-py39/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6
4. System Info
- OS: Ubuntu 18.04
- GPU: A100 (40G each) x8
- CPU x60, Mem 500G
5. Solution? Help! I understand that error code -7 does not mean OOM (OOM is -9). I've searched over the whole Internet while failing to get the clue of error code -7 deepspeed. So please help, what is the meaning of error code -7? And how to solve this and make the code training smoothly?
@scofield7419, did you pass a ds_config on the command line? Can you share the contents of your ds_config? Thanks!
data:image/s3,"s3://crabby-images/54709/5470938cafc884e59a2b5dceaa9aee4be143e044" alt="image"
Hi @tjruwase, yes, in my trials, I used ds_flan_t5_z3_config_bf16.json
and ds_flan_t5_z3_offload_bf16.json
respectively as it is without any modification. But both results in the same issue, error code -7.
data:image/s3,"s3://crabby-images/1bd27/1bd27f71a04169c67a465a5cea595471582427ac" alt="b2c7dcb50f3afda273c79c9008bcded5"
Any update on this issue @scofield7419? :)
Hi @alexcoca, actually not... Are you facing with the same issue?
@scofield7419 Could you check your NCCL version and make sure it is greater than v2.10? BF16 support requires a NCCL version above 2.10.
Hi @jomayeri, thanks for the advice, but my cluster has an NCCL version of 2.14.3, is this working for the version requirement?
@scofield7419 Could you check your NCCL version and make sure it is greater than v2.10? BF16 support requires a NCCL version above 2.10.
data:image/s3,"s3://crabby-images/913a3/913a38fafa0bbb4cba995ca318edc52affe68082" alt="7f037d1f75710784d1243187545914e8"
I have the same problem.
@scofield7419, we got quite a few similar sounding issues recently and most were due to activation memory. Can you please examine the following issues: #2797, #2946, #2996.
Hi @tjruwase, I'm using 500G memory, with only 1-2 batch size. I tend not to believe this is the OOM problem. Also I tested with FLANT5-xl smaller one, and ended up with the same error.
I have the same issue too when working on pythia, it works with fp16, but when switched to bf16, it errors out with -7 without showing any other meaningful error. The model is only 1.5B and I have 8 80G A100s. It's definitely not OOM. NCCL version is also correct. So I believe something is wrong in deepspeed.
Here is my ds_config for fp16 that works:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 0.0001
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Here is the ds config that doesn't work:
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Did you solved it? I meet the same problem.
same error! save me!
same error!
Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.
I am having the same problem, for me the training of pythia 2.8 b in running on Azure VM 4 T4 GPU, but when I am trying to containerize and run in a container I am getting this exit code.
@Shrishml Try increasing the shm-size of the container as suggested in the comment above.
I ran with this aks cluster yaml
https://stackoverflow.com/questions/43373463/how-to-increase-shm-size-of-a-kubernetes-container-shm-size-equivalent-of-doc
or docker command docker run --rm --runtime=nvidia --gpus all --shm-size 3gb imagename
it worked
Increasing the shared memory size of the docker container seems to resolve the issue.
Hi @scofield7419 , did you end up finding a solution for this? Running into the same issue. I am not using a docker container.
@jomayeri is there a solution for this if I am not using a docker container?
@chaitanyamalaviya Please open a new bug with the requested information and assign it to me and I'll take a look.
Done, created an issue here.