DeepSpeed Finetune T5 11B and the process is killed . exits with return code = -9[BUG]

trafficstars

Describe the bug Hi, I want to finetune T5 model (11B). But the process is killed and exits with return code = -9

[2023-03-05 06:40:25,173] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698707 [2023-03-05 06:40:27,249] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698708 [2023-03-05 06:40:27,250] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698709 [2023-03-05 06:40:28,626] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698710 [2023-03-05 06:40:30,045] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698711 [2023-03-05 06:40:31,498] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698712 [2023-03-05 06:40:32,912] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698713 [2023-03-05 06:40:34,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698714 [2023-03-05 06:40:35,705] [ERROR] [launch.py:324:sigkill_handler] ['/home/lizhi/anaconda3/envs/tk-instruct/bin/python', '-u', 'src/run_s2s.py', '--local_rank=7', '--do_train', '--do_predict', '--predict_with_generate', '--model_name_or_path', '/home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt', '--max_source_length', '1024', '--max_target_length', '128', '--generation_max_length', '128', '--max_num_instances_per_task', '1', '--max_num_instances_per_eval_task', '1', '--add_task_name', 'False', '--add_task_definition', 'True', '--num_pos_examples', '2', '--num_neg_examples', '0', '--add_explanation', 'False', '--tk_instruct', 'False', '--data_dir', 'data/splits/default', '--task_dir', 'data/tasks', '--output_dir', 'output/', '--overwrite_output_dir', '--cache_dir', './cache/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '5e-05', '--num_train_epochs', '1', '--lr_scheduler_type', 'constant', '--warmup_steps', '0', '--logging_strategy', 'steps', '--logging_steps', '500', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2500', '--deepspeed', 'ds_configs/11b_stage3_offload.config', '--bf16', '--run_name', 't5-experiment'] exits with return code = -9

To Reproduce Steps to reproduce the behavior:

The code is based on the project https://github.com/yizhongw/Tk-Instruct Just add a new config (name it 11b_stage3_offload.config)under the folder ds_configs The content of the new config is :

{ "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": false }, "offload_param": { "device": "cpu", "pin_memory": false }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": false }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

Then modify the two parameters of scripts https://github.com/yizhongw/Tk-Instruct/blob/main/scripts/train_tk_instruct.sh : repalce the --model_name_or_path google/t5-xl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config
with --model_name_or_path google/t5-xxl-lm-adapt --deepspeed ds_configs/11b_stage3_offload.config

and add --bf16

Expected behavior I hope I can finetune this model.

ds_report output

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 20.04]
GPU count and types [ one machines with x8 RTX6000, 48G each GPU. ]
Python version 3.8.16

Launcher context #!/bin/bash set -x

export CUDA_DEVICE_ORDER="PCI_BUS_ID" export TRANSFORMERS_CACHE=/home/lizhi/.cache/huggingface

port=$(shuf -i25000-30000 -n1)

deepspeed --master_port $port src/run_s2s.py
--do_train
--do_predict
--predict_with_generate
--model_name_or_path google/t5-xxl-lm-adapt
--max_source_length 1024
--max_target_length 128
--generation_max_length 128
--max_num_instances_per_task 1
--max_num_instances_per_eval_task 1
--add_task_name False
--add_task_definition True
--num_pos_examples 2
--num_neg_examples 0
--add_explanation False
--tk_instruct False
--data_dir data/splits/default
--task_dir data/tasks
--output_dir output/
--overwrite_output_dir
--cache_dir ./cache/
--overwrite_cache
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 5e-05
--num_train_epochs 1
--lr_scheduler_type constant
--warmup_steps 0
--logging_strategy steps
--logging_steps 500
--evaluation_strategy no
--save_strategy steps
--save_steps 2500
--deepspeed ds_configs/11b_stage3_offload.config
--bf16
--run_name t5-experiment

Mar 05 '23 07:03 zhilizju

I install two packages and the new ds_report output:

I find that when I start to run this scripts, the memory of cpu is getting full gradually. (avaiable from 332706 to 0)

This means whether we can't train the model even with offload ? But I find this project https://github.com/philschmid/deep-learning-pytorch-huggingface have similar 8 gpus and successfully train a 11B T5 model.

Mar 05 '23 08:03 zhilizju

Any help would be appreciated @tjruwase @stas00

Mar 05 '23 08:03 zhilizju

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Mar 05 '23 11:03 lambda7xx

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

Mar 05 '23 11:03 zhilizju

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

You mean training a 11B model on your 8 * 48G GPU memory. I think it's enough even you don't use zero. I try to training 15B model on 8*32G GPU, it's OK.

Mar 05 '23 11:03 lambda7xx

Yes, it should be enough. But I don't know why it doesn't work. This is why I new this issue. I also try it without zero and it also raise -9 error. See the issue https://github.com/yizhongw/Tk-Instruct/issues/22#issue-1610040124.

Mar 05 '23 12:03 zhilizju

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

You mean training a 11B model on your 8 * 48G GPU memory. I think it's enough even you don't use zero. I try to training 15B model on 8*32G GPU, it's OK.

Would you like to share your 15B deepspeed config ? I have sent you an email.

Mar 05 '23 12:03 zhilizju

I do not use deepspeed to run 15B model. I use the alpa to run 15 model on 32GPUs

Mar 05 '23 12:03 lambda7xx

I do not use deepspeed to run 15B model. I use the alpa to run 15 model on 32GPUs

Anyway, thanks !

Mar 05 '23 12:03 zhilizju

Still need helps. @tjruwase @stas00

Mar 06 '23 04:03 zhilizju

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Mar 06 '23 11:03 tjruwase

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot ! But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

Mar 06 '23 17:03 zhilizju

I am glad that worked for you. I hope you are able to fit a reasonable batch size for your finetuning.

In terms of the high CPU usage, I have suspicions of the cause which I am trying to address with #2953. Unfortunately, I am not able to actually reproduce the problem yet on my side. If you have interest or bandwidth to help, I can share some instrumented branch with you for more detailed profiling. But the important thing is to unblock you.

Mar 06 '23 18:03 tjruwase

I am glad that worked for you. I hope you are able to fit a reasonable batch size for your finetuning.

In terms of the high CPU usage, I have suspicions of the cause which I am trying to address with #2953. Unfortunately, I am not able to actually reproduce the problem yet on my side. If you have interest or bandwidth to help, I can share some instrumented branch with you for more detailed profiling. But the important thing is to unblock you.

Thanks for your kind reply ! I have interest to address this issue ! I think it is important. But I am a new deepseepd user and may not be able to help much.

Mar 07 '23 02:03 zhilizju

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot ! But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

Hi did you manage to run the whole process? For me, I was able to initialize the model but as soon as it reaches training, the OOM error appears. I'm training a 13B model with 8x48G machine, without any offloading.

Mar 11 '23 02:03 xinj7

enable gradient checkpointing to liberate a ton of gpu memory, see: https://github.com/microsoft/DeepSpeed/issues/2797#issuecomment-1423466674

in some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM.

Mar 11 '23 02:03 stas00

enable gradient checkpointing to liberate a ton of gpu memory, see: #2797 (comment)

in some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM.

Thanks for the suggestion, but I'm already using gradient checkpointing in the trainer. I wonder if it's just impossible to train 13B model in fp32(the gpu doesn't support bf16) on 8x48G machine without offloading (I have 150G available RAM cpu so no offloading).

Setting:

OOM error:

Mar 11 '23 03:03 xinj7

Your quest is different from this Issue and should be dealt separately.

Could you please open a new Issue where you specify all the details of your setup - e.g. you're missing your ds config file and you are not showing your command line or value of args so it's very difficult for me to see the full picture.

There are other solutions for saving gpu memory - e.g., using BNB's 8-bit optimizer https://github.com/huggingface/transformers/pull/15622 - though I haven't tried it with Deepspeed - But let's discuss it there.

Please tag me on it.

Mar 11 '23 03:03 stas00

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot ! But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

Hi did you manage to run the whole process? For me, I was able to initialize the model but as soon as it reaches training, the OOM error appears. I'm training a 13B model with 8x48G machine, without any offloading.

I use bf16 rather than fp32. My batch size on each GPU is 1 and it will lead to OOM if I increase the batch size.（And I suspect that model with fp32 may be difficult to be trained.） But I can increase “--gradient_accumulation_steps ” to 8, so the total batch size can be 64. Hope this information helps you and stas00 also gives some good suggestions worth trying. Best wishes !

Mar 11 '23 16:03 zhilizju

@zhilizju, is it okay to close this issue since the original problem is resolved?

Mar 17 '23 17:03 tjruwase

DeepSpeed DeepSpeed copied to clipboard

Finetune T5 11B and the process is killed . exits with return code = -9[BUG]

DeepSpeed
DeepSpeed copied to clipboard