DeepSpeed [BUG] Process exit error on finetuning Flan T5 XXL on GCP A100 GPUs

Describe the bug I am following this blog to finetune Flan T5 XXL on GCP (a2-highgpu-8: A100 40Gb 8 gpus) and I get "Process exits successfully" error. I even tried wit smaller Flant t5 model. Same error.

### Tasks
- [ ] Add a draft title or issue reference here

### Tasks
- [ ] Add a draft title or issue reference here

To Reproduce Steps to reproduce the behavior:

Go to GCP create VM a2-highgpu-8: A100 40Gb 8 gpus
Folow instructions (including installations as mentioned in this blog
Run the command mentioned in the blog. You can replace "google/flan-t5-small" for quickly replicating it.
See error

Expected behavior STDOUT: deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py --model_id google/flan-t5-xxl --dataset_path data --epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --generation_max_length 100 --lr 1e-4 --deepspeed configs/ds_flan_t5_z3_config_bf16.json [2023-03-14 13:39:52,067] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-03-14 13:40:08,533] [INFO] [runner.py:548:main] cmd = /opt/conda/bin/python3.7 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None scripts/run_seq2seq_deepspeed.py --model_id google/flan-t5-small --dataset_path ../data/qna_training_data/snippet_level_processed --epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --generation_max_length 100 --lr 1e-4 --deepspeed configs/ds_flan_t5_z3_config_bf16.json [2023-03-14 13:40:10,661] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-03-14 13:40:10,661] [INFO] [launch.py:149:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-03-14 13:40:10,661] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-03-14 13:40:10,661] [INFO] [launch.py:162:main] dist_world_size=8 [2023-03-14 13:40:10,661] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84965 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84967 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84968 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84964 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84966 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84963 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84962 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84969 exits successfully.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Debian 10 based Deep Learning VM for PyTorch CPU/GPU with CUDA 11.3 M103
GPU count and types 8 A100 40 GB
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version: 3.7.12
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Mar 14 '23 18:03 chagri

@chagri, please see the following issues for resolution: #2797, #2946, #2996.

Mar 17 '23 18:03 tjruwase

@tjruwase Thanks for the response. I looked at these issues. I don't think these issues are related to the issue I am facing. Maybe I a missing something?

I do not seem to face memory issue. I even tried FLAN-T5-large which is just 800M parameters. I think process simply exists without offloading to CPU or giving memory error.

Mar 23 '23 17:03 chagri

I am having a similar issue here, a very similar setting on GCP using A100s to train XXL. Any updates on this fix? The other posts don't seem to fix it.

Mar 29 '23 01:03 lnevesg

@chagri, @lnevesg, unfortunately we are unable to repro this issue. Out of curiosity,

Are you able to train/finetune other models in this environment?
Also, can you repro the problem with single gpu using google/flan-t5-small?

May 13 '23 11:05 tjruwase

Closing for lack of response. Please check if #4015 is related. Also, re-open if needed.

Aug 10 '23 17:08 tjruwase

DeepSpeed DeepSpeed copied to clipboard

[BUG] Process exit error on finetuning Flan T5 XXL on GCP A100 GPUs

DeepSpeed
DeepSpeed copied to clipboard