DeepSpeed
DeepSpeed copied to clipboard
[BUG] Process exit error on finetuning Flan T5 XXL on GCP A100 GPUs
Describe the bug I am following this blog to finetune Flan T5 XXL on GCP (a2-highgpu-8: A100 40Gb 8 gpus) and I get "Process exits successfully" error. I even tried wit smaller Flant t5 model. Same error.
### Tasks
- [ ] Add a draft title or issue reference here
### Tasks
- [ ] Add a draft title or issue reference here
To Reproduce Steps to reproduce the behavior:
- Go to GCP create VM a2-highgpu-8: A100 40Gb 8 gpus
- Folow instructions (including installations as mentioned in this blog
- Run the command mentioned in the blog. You can replace "google/flan-t5-small" for quickly replicating it.
- See error
Expected behavior
STDOUT:
deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py --model_id google/flan-t5-xxl --dataset_path data --epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --generation_max_length 100 --lr 1e-4 --deepspeed configs/ds_flan_t5_z3_config_bf16.json [2023-03-14 13:39:52,067] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-03-14 13:40:08,533] [INFO] [runner.py:548:main] cmd = /opt/conda/bin/python3.7 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None scripts/run_seq2seq_deepspeed.py --model_id google/flan-t5-small --dataset_path ../data/qna_training_data/snippet_level_processed --epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --generation_max_length 100 --lr 1e-4 --deepspeed configs/ds_flan_t5_z3_config_bf16.json [2023-03-14 13:40:10,661] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-03-14 13:40:10,661] [INFO] [launch.py:149:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-03-14 13:40:10,661] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-03-14 13:40:10,661] [INFO] [launch.py:162:main] dist_world_size=8 [2023-03-14 13:40:10,661] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84965 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84967 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84968 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84964 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84966 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84963 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84962 exits successfully. [2023-03-14 13:40:11,693] [INFO] [launch.py:350:main] Process 84969 exits successfully.
ds_report output
Please run ds_report
to give us details about your setup.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Debian 10 based Deep Learning VM for PyTorch CPU/GPU with CUDA 11.3 M103
- GPU count and types 8 A100 40 GB
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version: 3.7.12
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
@chagri, please see the following issues for resolution: #2797, #2946, #2996.
@tjruwase Thanks for the response. I looked at these issues. I don't think these issues are related to the issue I am facing. Maybe I a missing something?
I do not seem to face memory issue. I even tried FLAN-T5-large which is just 800M parameters. I think process simply exists without offloading to CPU or giving memory error.
I am having a similar issue here, a very similar setting on GCP using A100s to train XXL. Any updates on this fix? The other posts don't seem to fix it.
@chagri, @lnevesg, unfortunately we are unable to repro this issue. Out of curiosity,
- Are you able to train/finetune other models in this environment?
- Also, can you repro the problem with single gpu using google/flan-t5-small?
Closing for lack of response. Please check if #4015 is related. Also, re-open if needed.