DeepSpeed [BUG] Error "exits with return code -7" when finetuning FLANT5-xxl on 8x A100

1. Bug Description

I am finetuning Flan-T5-xxl with my corpus using DeepSpeed, based on the tutorial. But when I execute 'deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py', when the GPUs have loaded all shards, the process has immediately been terminated, exits with return code -7, and without any further error traceback.

Here are the full console outputs:

$ deepspeed --num_gpus=8 run_seq2seq_deepspeed.py
[2023-02-24 18:10:18,983] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-24 18:10:19,049] [INFO] [runner.py:548:main] cmd = /home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seq2seq_deepspeed.py
[2023-02-24 18:10:22,043] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-02-24 18:10:22,043] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-02-24 18:10:22,043] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-02-24 18:10:22,043] [INFO] [launch.py:162:main] dist_world_size=8
[2023-02-24 18:10:22,043] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:34<00:00,  6.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:37<00:00,  7.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00,  8.11s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:41<00:00,  8.37s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.74s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:44<00:00,  8.95s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:45<00:00,  9.10s/it]
[2023-02-24 18:12:41,354] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using cuda_amp half precision backend
[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17786
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17787
[2023-02-24 18:13:10,286] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17788
[2023-02-24 18:13:10,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17789
[2023-02-24 18:13:10,620] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17790
[2023-02-24 18:13:10,953] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17791
[2023-02-24 18:13:10,954] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17792
[2023-02-24 18:13:11,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17793
[2023-02-24 18:13:11,901] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python', '-u', 'run_seq2seq_deepspeed.py', '--local_rank=7'] exits with return code = -7

2. Screenshots

here is the console cursor before being killed:

[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown

here are the GPU and CPU states screenshots before being killed:

3. ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/aiops/xxxx/.miniconda3/envs/trm-pt-py39/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/home/aiops/xxxx/.miniconda3/envs/trm-pt-py39/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

4. System Info

OS: Ubuntu 18.04
GPU: A100 (40G each) x8
CPU x60, Mem 500G

5. Solution? Help! I understand that error code -7 does not mean OOM (OOM is -9). I've searched over the whole Internet while failing to get the clue of error code -7 deepspeed. So please help, what is the meaning of error code -7? And how to solve this and make the code training smoothly?

Feb 25 '23 03:02 scofield7419

@scofield7419, did you pass a ds_config on the command line? Can you share the contents of your ds_config? Thanks!

Feb 25 '23 04:02 tjruwase

Hi @tjruwase, yes, in my trials, I used ds_flan_t5_z3_config_bf16.json and ds_flan_t5_z3_offload_bf16.json respectively as it is without any modification. But both results in the same issue, error code -7.

Feb 25 '23 05:02 scofield7419

Any update on this issue @scofield7419? :)

Mar 02 '23 13:03 alexcoca

Hi @alexcoca, actually not... Are you facing with the same issue?

Mar 02 '23 13:03 scofield7419

@scofield7419 Could you check your NCCL version and make sure it is greater than v2.10? BF16 support requires a NCCL version above 2.10.

Mar 02 '23 18:03 jomayeri

Hi @jomayeri, thanks for the advice, but my cluster has an NCCL version of 2.14.3, is this working for the version requirement?

Mar 05 '23 01:03 scofield7419

@scofield7419 Could you check your NCCL version and make sure it is greater than v2.10? BF16 support requires a NCCL version above 2.10.

Mar 06 '23 13:03 scofield7419

I have the same problem.

Mar 15 '23 09:03 luxuantao

@scofield7419, we got quite a few similar sounding issues recently and most were due to activation memory. Can you please examine the following issues: #2797, #2946, #2996.

Mar 17 '23 18:03 tjruwase

Hi @tjruwase, I'm using 500G memory, with only 1-2 batch size. I tend not to believe this is the OOM problem. Also I tested with FLANT5-xl smaller one, and ended up with the same error.

Mar 19 '23 04:03 scofield7419

I have the same issue too when working on pythia, it works with fp16, but when switched to bf16, it errors out with -7 without showing any other meaningful error. The model is only 1.5B and I have 8 80G A100s. It's definitely not OOM. NCCL version is also correct. So I believe something is wrong in deepspeed.

Here is my ds_config for fp16 that works:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 0.0001
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Mar 25 '23 16:03 kyleliang919

Here is the ds config that doesn't work:

    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Mar 25 '23 16:03 kyleliang919

Did you solved it? I meet the same problem.

Apr 08 '23 09:04 hahchenchen

same error! save me!

Apr 27 '23 05:04 codender

same error!

Apr 28 '23 04:04 shunjiu

Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.

May 05 '23 06:05 luchenyu

I am having the same problem, for me the training of pythia 2.8 b in running on Azure VM 4 T4 GPU, but when I am trying to containerize and run in a container I am getting this exit code.

May 08 '23 11:05 Shrishml

@Shrishml Try increasing the shm-size of the container as suggested in the comment above.

May 08 '23 17:05 jomayeri

I ran with this aks cluster yaml

https://stackoverflow.com/questions/43373463/how-to-increase-shm-size-of-a-kubernetes-container-shm-size-equivalent-of-doc

or docker command docker run --rm --runtime=nvidia --gpus all --shm-size 3gb imagename

it worked

May 09 '23 07:05 Shrishml

Increasing the shared memory size of the docker container seems to resolve the issue.

May 09 '23 20:05 jomayeri

Hi @scofield7419 , did you end up finding a solution for this? Running into the same issue. I am not using a docker container.

May 22 '23 20:05 chaitanyamalaviya

@jomayeri is there a solution for this if I am not using a docker container?

May 23 '23 22:05 chaitanyamalaviya

@chaitanyamalaviya Please open a new bug with the requested information and assign it to me and I'll take a look.

May 24 '23 16:05 jomayeri

Done, created an issue here.

May 24 '23 17:05 chaitanyamalaviya

DeepSpeed DeepSpeed copied to clipboard

[BUG] Error "exits with return code -7" when finetuning FLANT5-xxl on 8x A100

DeepSpeed
DeepSpeed copied to clipboard