issue with loading pretrained model using DeepSpeed Zero Stage 3
System Info
- `transformers` version: 4.19.0.dev0
- Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.12.0.dev20220505+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes (deepspeed zero stage-3)
Who can help?
@stas00 @sgugger
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Steps to reproduce the behaviour:
- Official
run_glue.pyscript - Below ZERO Stage-3 Config
zero3_config.json:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
- bash script to run the finetuning of
bert-base-uncasedon MRPC dataset using ZERO Stage-3.
#!/bin/bash
time torchrun --nproc_per_node=2 run_glue.py \
--task_name "mrpc" \
--max_seq_len 128 \
--model_name_or_path "bert-base-uncased" \
--output_dir "./glue/mrpc_deepspeed_stage3_trainer" \
--overwrite_output_dir \
--do_train \
--evaluation_strategy "epoch" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--max_grad_norm 1.0 \
--num_train_epochs 3 \
--lr_scheduler_type "linear" \
--warmup_steps 50 \
--logging_steps 100 \
--fp16 \
--fp16_full_eval \
--optim "adamw_torch" \
--report_to "wandb" \
--deepspeed "zero3_config.json"
- Relevant output snippets. The first one shows the weird behaviour wherein the model isn't being properly initialized with the pretrained weights. The second shows the eval metrics showing the random performance.

Expected behavior
Model being properly initialized with the pretrained weights when using DeepSpeed ZERO Stage-3. This should resolve the bad model performance being observed.
sounds like a potential problem with pt-nightly?
It works just fine on pt-1.11 - this is adapted to use the files from repo directly:
torchrun --nproc_per_node=2 examples/pytorch/text-classification/run_glue.py \
--task_name mrpc --max_seq_len 128 --model_name_or_path bert-base-uncased \
--output_dir xxx --overwrite_output_dir --do_train --evaluation_strategy epoch \
--per_device_train_batch_size 1 --per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 --learning_rate 2e-5 --weight_decay 0.0 \
--max_grad_norm 1.0 --num_train_epochs 3 --lr_scheduler_type linear \
--warmup_steps 50 --logging_steps 100 --fp16 --fp16_full_eval --optim \
adamw_torch --deepspeed tests/deepspeed/ds_config_zero3.json
but I need to look closely - as you're reporting quality issues and not that it fails. Will retest with 1.12 and then check the log closely.
pt-nightly works just fine
I get a very nice learning curve:
[INFO|trainer.py:1428] 2022-05-18 17:56:02,223 >> ***** Running training *****
[INFO|trainer.py:1429] 2022-05-18 17:56:02,224 >> Num examples = 3668
[INFO|trainer.py:1430] 2022-05-18 17:56:02,224 >> Num Epochs = 3
[INFO|trainer.py:1431] 2022-05-18 17:56:02,224 >> Instantaneous batch size per device = 32
[INFO|trainer.py:1432] 2022-05-18 17:56:02,224 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1433] 2022-05-18 17:56:02,224 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1434] 2022-05-18 17:56:02,224 >> Total optimization steps = 345
0%| | 0/345 [00:00<?, ?it/s][2022-05-18 17:56:02,941] [INFO] [stage3.py:2240:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 65536
0%|â–Ž | 1/345 [00:00<04:04, 1.41it/s][2022-05-18 17:56:03,946] [INFO] [stage3.py:2240:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768.0
{'loss': 1.1734, 'learning_rate': 1.0631029208133474e-05, 'epoch': 0.09}
{'loss': 0.8276, 'learning_rate': 1.4776864828686414e-05, 'epoch': 0.17}
{'loss': 0.6035, 'learning_rate': 1.7035710196752873e-05, 'epoch': 0.26}
{'loss': 0.5612, 'learning_rate': 1.859695689252868e-05, 'epoch': 0.35}
{'loss': 0.5857, 'learning_rate': 1.9791299823832263e-05, 'epoch': 0.43}
{'loss': 0.5462, 'learning_rate': 2e-05, 'epoch': 0.52}
{'loss': 0.5273, 'learning_rate': 2e-05, 'epoch': 0.61}
{'loss': 0.5543, 'learning_rate': 2e-05, 'epoch': 0.7}
{'loss': 0.5658, 'learning_rate': 2e-05, 'epoch': 0.78}
{'loss': 0.5612, 'learning_rate': 2e-05, 'epoch': 0.87}
{'loss': 0.5069, 'learning_rate': 2e-05, 'epoch': 0.96}
33%|█████████████████████████████████ | 115/345 [01:08<02:15, 1.69it/s][INFO|trainer.py:625] 2022-05-18 17:57:10,457 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
[INFO|trainer.py:2625] 2022-05-18 17:57:10,458 >> ***** Running Evaluation *****
[INFO|trainer.py:2627] 2022-05-18 17:57:10,458 >> Num examples = 408
[INFO|trainer.py:2630] 2022-05-18 17:57:10,458 >> Batch size = 32
05/18/2022 17:57:12 - INFO - datasets.metric - Removing /home/stas/.cache/huggingface/metrics/glue/mrpc/default_experiment-1-0.arrow3it/s]
{'eval_loss': 0.460205078125, 'eval_accuracy': 0.8112745098039216, 'eval_f1': 0.8701517706576728, 'eval_combined_score': 0.8407131402307972, 'eval_runtime': 1.5702, 'eval_samples_per_second': 259.84, 'eval_steps_per_second': 8.279, 'epoch': 1.0}
{'loss': 0.4829, 'learning_rate': 2e-05, 'epoch': 1.04}
{'loss': 0.4404, 'learning_rate': 2e-05, 'epoch': 1.13}
{'loss': 0.4361, 'learning_rate': 2e-05, 'epoch': 1.22}
{'loss': 0.3961, 'learning_rate': 2e-05, 'epoch': 1.3}
{'loss': 0.3944, 'learning_rate': 2e-05, 'epoch': 1.39}
{'loss': 0.4435, 'learning_rate': 2e-05, 'epoch': 1.48}
{'loss': 0.3121, 'learning_rate': 2e-05, 'epoch': 1.57}
52%|███████████████████████████████████████████████████▋ | 180/345 [01:47<01:38, 1.68it/s][2022-05-18 17:57:50,495] [INFO] [stage3.py:2240:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
{'loss': 0.3598, 'learning_rate': 2e-05, 'epoch': 1.65}
{'loss': 0.3626, 'learning_rate': 2e-05, 'epoch': 1.74}
{'loss': 0.3431, 'learning_rate': 2e-05, 'epoch': 1.83}
{'loss': 0.4219, 'learning_rate': 2e-05, 'epoch': 1.91}
{'loss': 0.3931, 'learning_rate': 2e-05, 'epoch': 2.0}
67%|██████████████████████████████████████████████████████████████████ | 230/345 [02:16<01:06, 1.72it/s][INFO|trainer.py:625] 2022-05-18 17:58:18,996 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
[INFO|trainer.py:2625] 2022-05-18 17:58:18,997 >> ***** Running Evaluation *****
[INFO|trainer.py:2627] 2022-05-18 17:58:18,997 >> Num examples = 408
[INFO|trainer.py:2630] 2022-05-18 17:58:18,997 >> Batch size = 32
05/18/2022 17:58:20 - INFO - datasets.metric - Removing /home/stas/.cache/huggingface/metrics/glue/mrpc/default_experiment-1-0.arrow2it/s]
{'eval_loss': 0.385986328125, 'eval_accuracy': 0.8284313725490197, 'eval_f1': 0.8776223776223777, 'eval_combined_score': 0.8530268750856986, 'eval_runtime': 1.3856, 'eval_samples_per_second': 294.452, 'eval_steps_per_second': 9.382, 'epoch': 2.0}
{'loss': 0.2824, 'learning_rate': 2e-05, 'epoch': 2.09}
{'loss': 0.2692, 'learning_rate': 2e-05, 'epoch': 2.17}
{'loss': 0.2422, 'learning_rate': 2e-05, 'epoch': 2.26}
{'loss': 0.2489, 'learning_rate': 2e-05, 'epoch': 2.35}
{'loss': 0.201, 'learning_rate': 2e-05, 'epoch': 2.43}
{'loss': 0.203, 'learning_rate': 2e-05, 'epoch': 2.52}
{'loss': 0.2521, 'learning_rate': 2e-05, 'epoch': 2.61}
{'loss': 0.2343, 'learning_rate': 2e-05, 'epoch': 2.7}
{'loss': 0.1918, 'learning_rate': 2e-05, 'epoch': 2.78}
{'loss': 0.2203, 'learning_rate': 2e-05, 'epoch': 2.87}
96%|██████████████████████████████████████████████████████████████████████████████████████████████▋ | 330/345 [03:16<00:08, 1.72it/s][2022-05-18 17:59:19,226] [INFO] [stage3.py:2240:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
{'loss': 0.2284, 'learning_rate': 2e-05, 'epoch': 2.96}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 345/345 [03:25<00:00, 1.73it/s][INFO|trainer.py:625] 2022-05-18 17:59:27,488 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
[INFO|trainer.py:2625] 2022-05-18 17:59:27,489 >> ***** Running Evaluation *****
[INFO|trainer.py:2627] 2022-05-18 17:59:27,489 >> Num examples = 408
[INFO|trainer.py:2630] 2022-05-18 17:59:27,489 >> Batch size = 32
05/18/2022 17:59:28 - INFO - datasets.metric - Removing /home/stas/.cache/huggingface/metrics/glue/mrpc/default_experiment-1-0.arrow4it/s]
{'eval_loss': 0.57470703125, 'eval_accuracy': 0.8063725490196079, 'eval_f1': 0.8715447154471545, 'eval_combined_score': 0.8389586322333812, 'eval_runtime': 1.3657, 'eval_samples_per_second': 298.75, 'eval_steps_per_second': 9.519, 'epoch': 3.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 345/345 [03:26<00:00, 1.73it/s][INFO|trainer.py:1671] 2022-05-18 17:59:28,855 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 206.6319, 'train_samples_per_second': 53.254, 'train_steps_per_second': 1.67, 'train_loss': 0.41815963966259057, 'epoch': 3.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 345/345 [03:29<00:00, 1.64it/s]
[INFO|trainer.py:2375] 2022-05-18 17:59:32,227 >> Saving model checkpoint to xxx
[INFO|configuration_utils.py:446] 2022-05-18 17:59:32,227 >> Configuration saved in xxx/config.json
[INFO|modeling_utils.py:1546] 2022-05-18 17:59:32,236 >> Model weights saved in xxx/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-05-18 17:59:32,236 >> tokenizer config file saved in xxx/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-05-18 17:59:32,236 >> Special tokens file saved in xxx/special_tokens_map.json
[2022-05-18 17:59:32,461] [INFO] [engine.py:3177:save_16bit_model] Saving model weights to xxx/pytorch_model.bin
***** train metrics *****
epoch = 3.0
train_loss = 0.4182
train_runtime = 0:03:26.63
train_samples = 3668
train_samples_per_second = 53.254
train_steps_per_second = 1.67
05/18/2022 17:59:32 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:625] 2022-05-18 17:59:32,618 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
[INFO|trainer.py:2625] 2022-05-18 17:59:32,620 >> ***** Running Evaluation *****
[INFO|trainer.py:2627] 2022-05-18 17:59:32,621 >> Num examples = 408
[INFO|trainer.py:2630] 2022-05-18 17:59:32,621 >> Batch size = 32
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:01<00:00, 9.54it/s]05/18/2022 17:59:34 - INFO - datasets.metric - Removing /home/stas/.cache/huggingface/metrics/glue/mrpc/default_experiment-1-0.arrow
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:01<00:00, 10.07it/s]
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.8064
eval_combined_score = 0.839
eval_f1 = 0.8715
eval_loss = 0.5747
eval_runtime = 0:00:01.39
eval_samples = 408
eval_samples_per_second = 292.087
eval_steps_per_second = 9.307
So perhaps start with my cmd line - I think the only difference is that I use tests/deepspeed/ds_config_zero3.json - but it looks pretty similar and a larger bs, and no wandb - everything else is the same as yours I think.
torchrun --nproc_per_node=1 examples/pytorch/text-classification/run_glue.py \
--task_name mrpc --max_seq_len 128 --model_name_or_path bert-base-uncased \
--output_dir xxx --overwrite_output_dir --do_train --evaluation_strategy epoch \
--per_device_train_batch_size 32 --per_device_eval_batch_size 32 \
--gradient_accumulation_steps 1 --learning_rate 2e-5 --weight_decay 0.0 \
--max_grad_norm 1.0 --num_train_epochs 3 --lr_scheduler_type linear \
--warmup_steps 50 --logging_steps 10 --fp16 --fp16_full_eval --optim \
adamw_torch --deepspeed tests/deepspeed/ds_config_zero3.json
Clearly the shape mismatch warning is the red herring as you have correctly spotted. This basically means that the weights aren't getting loaded correctly and probably started from scratch because of that.
the main deepspeed config difference is:
- "type": "WarmupDecayLR",
+ "type": "WarmupLR",
but it shouldn't cause an issue with the pre-trained weights. I wonder why you see a different behavior.
Tried with your config file and it trains nicely as well (Didn't do till the end).
Hello Stas, Thank you for all the deep dive and prompt reply. I just now found a minor change that I had done in run_glue.py. It is the following wherein I add ignore_mismatched_sizes=True, to from_pretrained method. This is done so that I can load the pre-trained model with different number of output classes than the classification problem at hand.
model = AutoModelForSequenceClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
- use_auth_token=True if model_args.use_auth_token else None
+ use_auth_token=True if model_args.use_auth_token else None,
+ ignore_mismatched_sizes=True,
)
I can confirm that this is causing the issue. It is resulting in the shape mismatch warning and then poor performance. Below are the plots with and without this change.

Great to hear you found the cause.
In general when you use deepspeed ZeRO stage-3 and you see a shape that's of size 0, it's because the weights are sharded - the internals have all kinds of places where the weights are reconsolidated for you at the right places, but if you go on your own you have to do it yourself at times. Just grep for deepspeed.zero.GatheredParameters for examples.
If you don't need any additional help you can close the Issue at any time.
If you have further questions please don't hesitate to ask.
I think fixing this would be important as many users would use pretrained models to fine-tune on their task which will likely have different number of output classes than the pretrained model. Maybe option/choice/bool flag to not have deepspeed.zero.init or the logic in from_pretrained to load and partition layers on different GPUs would resolve this for small to medium models.
Please give me a full setup that I can reproduce your issue with and I will try to come up with a solution.
And also if you write your own trainer loop you definitely aren't forced to go through deepspeed.zero.init - it doesn't happen by default, you have to call it. See: https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models
Also deepspeed.zero.Init(enabled=False) will not pre-shard the model at load time. I wonder if we could ask the Deepspeed developers to add a new ds_config file variable that could control that via the config file - that way the user can easily turn it off at will. What do you think?
Exact setup to reproduce the above behaviour:
- Official
run_glue.pyscript with the following change.
model = AutoModelForSequenceClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
- use_auth_token=True if model_args.use_auth_token else None
+ use_auth_token=True if model_args.use_auth_token else None,
+ ignore_mismatched_sizes=True,
)
- Below ZERO Stage-3 Config
zero3_config.json:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
- bash script to run the finetuning of
bert-base-uncasedon MRPC dataset using ZERO Stage-3.
#!/bin/bash
time torchrun --nproc_per_node=2 run_glue.py \
--task_name "mrpc" \
--max_seq_len 128 \
--model_name_or_path "bert-base-uncased" \
--output_dir "./glue/mrpc_deepspeed_stage3_trainer" \
--overwrite_output_dir \
--do_train \
--evaluation_strategy "epoch" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--max_grad_norm 1.0 \
--num_train_epochs 3 \
--lr_scheduler_type "linear" \
--warmup_steps 50 \
--logging_steps 100 \
--fp16 \
--fp16_full_eval \
--optim "adamw_torch" \
--report_to "wandb" \
--deepspeed "zero3_config.json"
The issue is because of the logic at modeling_utils.py#L2182. Here, the zero-3 state dict with partitions are being checked against the pretrained model state_dict, which will result in all keys being mismatched and deleted from pretrained model state_dict.
Thank you, @pacman100
Please try this PR https://github.com/huggingface/transformers/pull/17373
Hello @stas00, yes the above PR solves this issue. Thank you 😄 . Below are the plots finetuning microsoft/deberta-v2-xlarge-mnli (pretrained model has 3 labels) on MRPC (this task has 2 labels) dataset.

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.