DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Error "exits with return code -7" when finetuning FLANT5-xxl on 8x A100

Open scofield7419 opened this issue 2 years ago • 19 comments

1. Bug Description

I am finetuning Flan-T5-xxl with my corpus using DeepSpeed, based on the tutorial. But when I execute 'deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py', when the GPUs have loaded all shards, the process has immediately been terminated, exits with return code -7, and without any further error traceback.

Here are the full console outputs:

$ deepspeed --num_gpus=8 run_seq2seq_deepspeed.py
[2023-02-24 18:10:18,983] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-24 18:10:19,049] [INFO] [runner.py:548:main] cmd = /home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seq2seq_deepspeed.py
[2023-02-24 18:10:22,043] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-02-24 18:10:22,043] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-02-24 18:10:22,043] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-02-24 18:10:22,043] [INFO] [launch.py:162:main] dist_world_size=8
[2023-02-24 18:10:22,043] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:34<00:00,  6.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:37<00:00,  7.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00,  8.11s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:41<00:00,  8.37s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.74s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:44<00:00,  8.95s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:45<00:00,  9.10s/it]
[2023-02-24 18:12:41,354] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using cuda_amp half precision backend
[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17786
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17787
[2023-02-24 18:13:10,286] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17788
[2023-02-24 18:13:10,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17789
[2023-02-24 18:13:10,620] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17790
[2023-02-24 18:13:10,953] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17791
[2023-02-24 18:13:10,954] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17792
[2023-02-24 18:13:11,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17793
[2023-02-24 18:13:11,901] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python', '-u', 'run_seq2seq_deepspeed.py', '--local_rank=7'] exits with return code = -7

2. Screenshots

  • here is the console cursor before being killed:
[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
  • here are the GPU and CPU states screenshots before being killed: 2QQ图片20230225105116 2QQ图片20230225105112

3. ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/aiops/xxxx/.miniconda3/envs/trm-pt-py39/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/home/aiops/xxxx/.miniconda3/envs/trm-pt-py39/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

4. System Info

  • OS: Ubuntu 18.04
  • GPU: A100 (40G each) x8
  • CPU x60, Mem 500G

5. Solution? Help! I understand that error code -7 does not mean OOM (OOM is -9). I've searched over the whole Internet while failing to get the clue of error code -7 deepspeed. So please help, what is the meaning of error code -7? And how to solve this and make the code training smoothly?

scofield7419 avatar Feb 25 '23 03:02 scofield7419

@scofield7419, did you pass a ds_config on the command line? Can you share the contents of your ds_config? Thanks!

image

tjruwase avatar Feb 25 '23 04:02 tjruwase

Hi @tjruwase, yes, in my trials, I used ds_flan_t5_z3_config_bf16.json and ds_flan_t5_z3_offload_bf16.json respectively as it is without any modification. But both results in the same issue, error code -7.

b2c7dcb50f3afda273c79c9008bcded5

scofield7419 avatar Feb 25 '23 05:02 scofield7419

Any update on this issue @scofield7419? :)

alexcoca avatar Mar 02 '23 13:03 alexcoca

Hi @alexcoca, actually not... Are you facing with the same issue?

scofield7419 avatar Mar 02 '23 13:03 scofield7419

@scofield7419 Could you check your NCCL version and make sure it is greater than v2.10? BF16 support requires a NCCL version above 2.10.

jomayeri avatar Mar 02 '23 18:03 jomayeri

Hi @jomayeri, thanks for the advice, but my cluster has an NCCL version of 2.14.3, is this working for the version requirement?

scofield7419 avatar Mar 05 '23 01:03 scofield7419

@scofield7419 Could you check your NCCL version and make sure it is greater than v2.10? BF16 support requires a NCCL version above 2.10.

7f037d1f75710784d1243187545914e8

scofield7419 avatar Mar 06 '23 13:03 scofield7419

I have the same problem.

luxuantao avatar Mar 15 '23 09:03 luxuantao

@scofield7419, we got quite a few similar sounding issues recently and most were due to activation memory. Can you please examine the following issues: #2797, #2946, #2996.

tjruwase avatar Mar 17 '23 18:03 tjruwase

Hi @tjruwase, I'm using 500G memory, with only 1-2 batch size. I tend not to believe this is the OOM problem. Also I tested with FLANT5-xl smaller one, and ended up with the same error.

scofield7419 avatar Mar 19 '23 04:03 scofield7419

I have the same issue too when working on pythia, it works with fp16, but when switched to bf16, it errors out with -7 without showing any other meaningful error. The model is only 1.5B and I have 8 80G A100s. It's definitely not OOM. NCCL version is also correct. So I believe something is wrong in deepspeed.

Here is my ds_config for fp16 that works:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 0.0001
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

kyleliang919 avatar Mar 25 '23 16:03 kyleliang919

Here is the ds config that doesn't work:

    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

kyleliang919 avatar Mar 25 '23 16:03 kyleliang919

Did you solved it? I meet the same problem.

hahchenchen avatar Apr 08 '23 09:04 hahchenchen

same error! save me!

codender avatar Apr 27 '23 05:04 codender

same error!

shunjiu avatar Apr 28 '23 04:04 shunjiu

Setting the shm-size to a large number instead of default 64MB when creating docker container solves the problem in my case. It appears that multi-gpu training relies on the shared memory.

luchenyu avatar May 05 '23 06:05 luchenyu

I am having the same problem, for me the training of pythia 2.8 b in running on Azure VM 4 T4 GPU, but when I am trying to containerize and run in a container I am getting this exit code.

Shrishml avatar May 08 '23 11:05 Shrishml

@Shrishml Try increasing the shm-size of the container as suggested in the comment above.

jomayeri avatar May 08 '23 17:05 jomayeri

I ran with this aks cluster yaml

https://stackoverflow.com/questions/43373463/how-to-increase-shm-size-of-a-kubernetes-container-shm-size-equivalent-of-doc

or docker command docker run --rm --runtime=nvidia --gpus all --shm-size 3gb imagename

it worked

Shrishml avatar May 09 '23 07:05 Shrishml

Increasing the shared memory size of the docker container seems to resolve the issue.

jomayeri avatar May 09 '23 20:05 jomayeri

Hi @scofield7419 , did you end up finding a solution for this? Running into the same issue. I am not using a docker container.

chaitanyamalaviya avatar May 22 '23 20:05 chaitanyamalaviya

@jomayeri is there a solution for this if I am not using a docker container?

chaitanyamalaviya avatar May 23 '23 22:05 chaitanyamalaviya

@chaitanyamalaviya Please open a new bug with the requested information and assign it to me and I'll take a look.

jomayeri avatar May 24 '23 16:05 jomayeri

Done, created an issue here.

chaitanyamalaviya avatar May 24 '23 17:05 chaitanyamalaviya