transformers issue with loading pretrained model using DeepSpeed Zero Stage 3

System Info

- `transformers` version: 4.19.0.dev0
- Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.12.0.dev20220505+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes (deepspeed zero stage-3)

Who can help?

@stas00 @sgugger

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Steps to reproduce the behaviour:

Official run_glue.py script
Below ZERO Stage-3 Config zero3_config.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

bash script to run the finetuning of bert-base-uncased on MRPC dataset using ZERO Stage-3.

#!/bin/bash

time torchrun --nproc_per_node=2 run_glue.py \
--task_name "mrpc" \
--max_seq_len 128 \
--model_name_or_path "bert-base-uncased" \
--output_dir "./glue/mrpc_deepspeed_stage3_trainer" \
--overwrite_output_dir \
--do_train \
--evaluation_strategy "epoch" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--max_grad_norm 1.0 \
--num_train_epochs 3 \
--lr_scheduler_type "linear" \
--warmup_steps 50 \
--logging_steps 100 \
--fp16 \
--fp16_full_eval \
--optim "adamw_torch" \
--report_to "wandb" \
--deepspeed "zero3_config.json"

Relevant output snippets. The first one shows the weird behaviour wherein the model isn't being properly initialized with the pretrained weights. The second shows the eval metrics showing the random performance.

model init bad performance

Expected behavior

Model being properly initialized with the pretrained weights when using DeepSpeed ZERO Stage-3. This should resolve the bad model performance being observed.

May 18 '22 19:05 pacman100

sounds like a potential problem with pt-nightly?

It works just fine on pt-1.11 - this is adapted to use the files from repo directly:

torchrun --nproc_per_node=2 examples/pytorch/text-classification/run_glue.py \
--task_name mrpc --max_seq_len 128 --model_name_or_path bert-base-uncased \
--output_dir xxx --overwrite_output_dir --do_train --evaluation_strategy epoch \
--per_device_train_batch_size 1 --per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 --learning_rate 2e-5 --weight_decay 0.0 \
--max_grad_norm 1.0 --num_train_epochs 3 --lr_scheduler_type linear \
--warmup_steps 50 --logging_steps 100 --fp16 --fp16_full_eval --optim \
adamw_torch --deepspeed tests/deepspeed/ds_config_zero3.json

but I need to look closely - as you're reporting quality issues and not that it fails. Will retest with 1.12 and then check the log closely.

May 19 '22 00:05 stas00

pt-nightly works just fine

I get a very nice learning curve:

[INFO|trainer.py:1428] 2022-05-18 17:56:02,223 >> ***** Running training *****
[INFO|trainer.py:1429] 2022-05-18 17:56:02,224 >>   Num examples = 3668
[INFO|trainer.py:1430] 2022-05-18 17:56:02,224 >>   Num Epochs = 3
[INFO|trainer.py:1431] 2022-05-18 17:56:02,224 >>   Instantaneous batch size per device = 32
[INFO|trainer.py:1432] 2022-05-18 17:56:02,224 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1433] 2022-05-18 17:56:02,224 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1434] 2022-05-18 17:56:02,224 >>   Total optimization steps = 345
  0%|                                                                                                             | 0/345 [00:00<?, ?it/s][2022-05-18 17:56:02,941] [INFO] [stage3.py:2240:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 65536
  0%|▎                                                                                                    | 1/345 [00:00<04:04,  1.41it/s][2022-05-18 17:56:03,946] [INFO] [stage3.py:2240:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768.0
{'loss': 1.1734, 'learning_rate': 1.0631029208133474e-05, 'epoch': 0.09}                                                                  
{'loss': 0.8276, 'learning_rate': 1.4776864828686414e-05, 'epoch': 0.17}                                                                  
{'loss': 0.6035, 'learning_rate': 1.7035710196752873e-05, 'epoch': 0.26}                                                                  
{'loss': 0.5612, 'learning_rate': 1.859695689252868e-05, 'epoch': 0.35}                                                                   
{'loss': 0.5857, 'learning_rate': 1.9791299823832263e-05, 'epoch': 0.43}                                                                  
{'loss': 0.5462, 'learning_rate': 2e-05, 'epoch': 0.52}                                                                                   
{'loss': 0.5273, 'learning_rate': 2e-05, 'epoch': 0.61}                                                                                   
{'loss': 0.5543, 'learning_rate': 2e-05, 'epoch': 0.7}                                                                                    
{'loss': 0.5658, 'learning_rate': 2e-05, 'epoch': 0.78}                                                                                   
{'loss': 0.5612, 'learning_rate': 2e-05, 'epoch': 0.87}                                                                                   
{'loss': 0.5069, 'learning_rate': 2e-05, 'epoch': 0.96}                                                                                   
 33%|█████████████████████████████████                                                                  | 115/345 [01:08<02:15,  1.69it/s][INFO|trainer.py:625] 2022-05-18 17:57:10,457 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2625] 2022-05-18 17:57:10,458 >> ***** Running Evaluation *****
[INFO|trainer.py:2627] 2022-05-18 17:57:10,458 >>   Num examples = 408
[INFO|trainer.py:2630] 2022-05-18 17:57:10,458 >>   Batch size = 32
05/18/2022 17:57:12 - INFO - datasets.metric - Removing /home/stas/.cache/huggingface/metrics/glue/mrpc/default_experiment-1-0.arrow3it/s]
{'eval_loss': 0.460205078125, 'eval_accuracy': 0.8112745098039216, 'eval_f1': 0.8701517706576728, 'eval_combined_score': 0.8407131402307972, 'eval_runtime': 1.5702, 'eval_samples_per_second': 259.84, 'eval_steps_per_second': 8.279, 'epoch': 1.0}                               
{'loss': 0.4829, 'learning_rate': 2e-05, 'epoch': 1.04}                                                                                   
{'loss': 0.4404, 'learning_rate': 2e-05, 'epoch': 1.13}                                                                                   
{'loss': 0.4361, 'learning_rate': 2e-05, 'epoch': 1.22}                                                                                   
{'loss': 0.3961, 'learning_rate': 2e-05, 'epoch': 1.3}                                                                                    
{'loss': 0.3944, 'learning_rate': 2e-05, 'epoch': 1.39}                                                                                   
{'loss': 0.4435, 'learning_rate': 2e-05, 'epoch': 1.48}                                                                                   
{'loss': 0.3121, 'learning_rate': 2e-05, 'epoch': 1.57}                                                                                   
 52%|███████████████████████████████████████████████████▋                                               | 180/345 [01:47<01:38,  1.68it/s][2022-05-18 17:57:50,495] [INFO] [stage3.py:2240:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
{'loss': 0.3598, 'learning_rate': 2e-05, 'epoch': 1.65}                                                                                   
{'loss': 0.3626, 'learning_rate': 2e-05, 'epoch': 1.74}                                                                                   
{'loss': 0.3431, 'learning_rate': 2e-05, 'epoch': 1.83}                                                                                   
{'loss': 0.4219, 'learning_rate': 2e-05, 'epoch': 1.91}                                                                                   
{'loss': 0.3931, 'learning_rate': 2e-05, 'epoch': 2.0}                                                                                    
 67%|██████████████████████████████████████████████████████████████████                                 | 230/345 [02:16<01:06,  1.72it/s][INFO|trainer.py:625] 2022-05-18 17:58:18,996 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2625] 2022-05-18 17:58:18,997 >> ***** Running Evaluation *****
[INFO|trainer.py:2627] 2022-05-18 17:58:18,997 >>   Num examples = 408
[INFO|trainer.py:2630] 2022-05-18 17:58:18,997 >>   Batch size = 32
05/18/2022 17:58:20 - INFO - datasets.metric - Removing /home/stas/.cache/huggingface/metrics/glue/mrpc/default_experiment-1-0.arrow2it/s]
{'eval_loss': 0.385986328125, 'eval_accuracy': 0.8284313725490197, 'eval_f1': 0.8776223776223777, 'eval_combined_score': 0.8530268750856986, 'eval_runtime': 1.3856, 'eval_samples_per_second': 294.452, 'eval_steps_per_second': 9.382, 'epoch': 2.0}                              
{'loss': 0.2824, 'learning_rate': 2e-05, 'epoch': 2.09}                                                                                   
{'loss': 0.2692, 'learning_rate': 2e-05, 'epoch': 2.17}                                                                                   
{'loss': 0.2422, 'learning_rate': 2e-05, 'epoch': 2.26}                                                                                   
{'loss': 0.2489, 'learning_rate': 2e-05, 'epoch': 2.35}                                                                                   
{'loss': 0.201, 'learning_rate': 2e-05, 'epoch': 2.43}                                                                                    
{'loss': 0.203, 'learning_rate': 2e-05, 'epoch': 2.52}                                                                                    
{'loss': 0.2521, 'learning_rate': 2e-05, 'epoch': 2.61}                                                                                   
{'loss': 0.2343, 'learning_rate': 2e-05, 'epoch': 2.7}                                                                                    
{'loss': 0.1918, 'learning_rate': 2e-05, 'epoch': 2.78}                                                                                   
{'loss': 0.2203, 'learning_rate': 2e-05, 'epoch': 2.87}                                                                                   
 96%|██████████████████████████████████████████████████████████████████████████████████████████████▋    | 330/345 [03:16<00:08,  1.72it/s][2022-05-18 17:59:19,226] [INFO] [stage3.py:2240:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
{'loss': 0.2284, 'learning_rate': 2e-05, 'epoch': 2.96}                                                                                   
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 345/345 [03:25<00:00,  1.73it/s][INFO|trainer.py:625] 2022-05-18 17:59:27,488 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2625] 2022-05-18 17:59:27,489 >> ***** Running Evaluation *****
[INFO|trainer.py:2627] 2022-05-18 17:59:27,489 >>   Num examples = 408
[INFO|trainer.py:2630] 2022-05-18 17:59:27,489 >>   Batch size = 32
05/18/2022 17:59:28 - INFO - datasets.metric - Removing /home/stas/.cache/huggingface/metrics/glue/mrpc/default_experiment-1-0.arrow4it/s]
{'eval_loss': 0.57470703125, 'eval_accuracy': 0.8063725490196079, 'eval_f1': 0.8715447154471545, 'eval_combined_score': 0.8389586322333812, 'eval_runtime': 1.3657, 'eval_samples_per_second': 298.75, 'eval_steps_per_second': 9.519, 'epoch': 3.0}                                
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 345/345 [03:26<00:00,  1.73it/s][INFO|trainer.py:1671] 2022-05-18 17:59:28,855 >>                                                                                         

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 206.6319, 'train_samples_per_second': 53.254, 'train_steps_per_second': 1.67, 'train_loss': 0.41815963966259057, 'epoch': 3.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 345/345 [03:29<00:00,  1.64it/s]
[INFO|trainer.py:2375] 2022-05-18 17:59:32,227 >> Saving model checkpoint to xxx
[INFO|configuration_utils.py:446] 2022-05-18 17:59:32,227 >> Configuration saved in xxx/config.json
[INFO|modeling_utils.py:1546] 2022-05-18 17:59:32,236 >> Model weights saved in xxx/pytorch_model.bin
[INFO|tokenization_utils_base.py:2108] 2022-05-18 17:59:32,236 >> tokenizer config file saved in xxx/tokenizer_config.json
[INFO|tokenization_utils_base.py:2114] 2022-05-18 17:59:32,236 >> Special tokens file saved in xxx/special_tokens_map.json
[2022-05-18 17:59:32,461] [INFO] [engine.py:3177:save_16bit_model] Saving model weights to xxx/pytorch_model.bin
***** train metrics *****
  epoch                    =        3.0
  train_loss               =     0.4182
  train_runtime            = 0:03:26.63
  train_samples            =       3668
  train_samples_per_second =     53.254
  train_steps_per_second   =       1.67
05/18/2022 17:59:32 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:625] 2022-05-18 17:59:32,618 >> The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence1, sentence2, idx. If sentence1, sentence2, idx are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2625] 2022-05-18 17:59:32,620 >> ***** Running Evaluation *****
[INFO|trainer.py:2627] 2022-05-18 17:59:32,621 >>   Num examples = 408
[INFO|trainer.py:2630] 2022-05-18 17:59:32,621 >>   Batch size = 32
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:01<00:00,  9.54it/s]05/18/2022 17:59:34 - INFO - datasets.metric - Removing /home/stas/.cache/huggingface/metrics/glue/mrpc/default_experiment-1-0.arrow
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:01<00:00, 10.07it/s]
***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.8064
  eval_combined_score     =      0.839
  eval_f1                 =     0.8715
  eval_loss               =     0.5747
  eval_runtime            = 0:00:01.39
  eval_samples            =        408
  eval_samples_per_second =    292.087
  eval_steps_per_second   =      9.307

So perhaps start with my cmd line - I think the only difference is that I use tests/deepspeed/ds_config_zero3.json - but it looks pretty similar and a larger bs, and no wandb - everything else is the same as yours I think.

torchrun --nproc_per_node=1 examples/pytorch/text-classification/run_glue.py \
--task_name mrpc --max_seq_len 128 --model_name_or_path bert-base-uncased \
--output_dir xxx --overwrite_output_dir --do_train --evaluation_strategy epoch \
--per_device_train_batch_size 32 --per_device_eval_batch_size 32 \
--gradient_accumulation_steps 1 --learning_rate 2e-5 --weight_decay 0.0 \
--max_grad_norm 1.0 --num_train_epochs 3 --lr_scheduler_type linear \
--warmup_steps 50 --logging_steps 10 --fp16 --fp16_full_eval --optim \
adamw_torch --deepspeed tests/deepspeed/ds_config_zero3.json

Clearly the shape mismatch warning is the red herring as you have correctly spotted. This basically means that the weights aren't getting loaded correctly and probably started from scratch because of that.

May 19 '22 01:05 stas00

the main deepspeed config difference is:

-        "type": "WarmupDecayLR",
+        "type": "WarmupLR",

but it shouldn't cause an issue with the pre-trained weights. I wonder why you see a different behavior.

Tried with your config file and it trains nicely as well (Didn't do till the end).

May 19 '22 01:05 stas00

Hello Stas, Thank you for all the deep dive and prompt reply. I just now found a minor change that I had done in run_glue.py. It is the following wherein I add ignore_mismatched_sizes=True, to from_pretrained method. This is done so that I can load the pre-trained model with different number of output classes than the classification problem at hand.

model = AutoModelForSequenceClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
-        use_auth_token=True if model_args.use_auth_token else None
+       use_auth_token=True if model_args.use_auth_token else None,
+        ignore_mismatched_sizes=True,
    )

I can confirm that this is causing the issue. It is resulting in the shape mismatch warning and then poor performance. Below are the plots with and without this change. Screenshot 2022-05-19 at 9 57 33 AM

May 19 '22 04:05 pacman100

Great to hear you found the cause.

In general when you use deepspeed ZeRO stage-3 and you see a shape that's of size 0, it's because the weights are sharded - the internals have all kinds of places where the weights are reconsolidated for you at the right places, but if you go on your own you have to do it yourself at times. Just grep for deepspeed.zero.GatheredParameters for examples.

If you don't need any additional help you can close the Issue at any time.

If you have further questions please don't hesitate to ask.

May 19 '22 04:05 stas00

I think fixing this would be important as many users would use pretrained models to fine-tune on their task which will likely have different number of output classes than the pretrained model. Maybe option/choice/bool flag to not have deepspeed.zero.init or the logic in from_pretrained to load and partition layers on different GPUs would resolve this for small to medium models.

May 19 '22 04:05 pacman100

Please give me a full setup that I can reproduce your issue with and I will try to come up with a solution.

And also if you write your own trainer loop you definitely aren't forced to go through deepspeed.zero.init - it doesn't happen by default, you have to call it. See: https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models

Also deepspeed.zero.Init(enabled=False) will not pre-shard the model at load time. I wonder if we could ask the Deepspeed developers to add a new ds_config file variable that could control that via the config file - that way the user can easily turn it off at will. What do you think?

May 19 '22 04:05 stas00

Exact setup to reproduce the above behaviour:

Official run_glue.py script with the following change.

model = AutoModelForSequenceClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
-        use_auth_token=True if model_args.use_auth_token else None
+       use_auth_token=True if model_args.use_auth_token else None,
+        ignore_mismatched_sizes=True,
    )

Below ZERO Stage-3 Config zero3_config.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

bash script to run the finetuning of bert-base-uncased on MRPC dataset using ZERO Stage-3.

#!/bin/bash

time torchrun --nproc_per_node=2 run_glue.py \
--task_name "mrpc" \
--max_seq_len 128 \
--model_name_or_path "bert-base-uncased" \
--output_dir "./glue/mrpc_deepspeed_stage3_trainer" \
--overwrite_output_dir \
--do_train \
--evaluation_strategy "epoch" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--max_grad_norm 1.0 \
--num_train_epochs 3 \
--lr_scheduler_type "linear" \
--warmup_steps 50 \
--logging_steps 100 \
--fp16 \
--fp16_full_eval \
--optim "adamw_torch" \
--report_to "wandb" \
--deepspeed "zero3_config.json"

May 19 '22 07:05 pacman100

The issue is because of the logic at modeling_utils.py#L2182. Here, the zero-3 state dict with partitions are being checked against the pretrained model state_dict, which will result in all keys being mismatched and deleted from pretrained model state_dict.

May 19 '22 11:05 pacman100

Thank you, @pacman100

Please try this PR https://github.com/huggingface/transformers/pull/17373

May 20 '22 21:05 stas00

Hello @stas00, yes the above PR solves this issue. Thank you 😄 . Below are the plots finetuning microsoft/deberta-v2-xlarge-mnli (pretrained model has 3 labels) on MRPC (this task has 2 labels) dataset. Screenshot 2022-05-24 at 12 18 30 PM

May 24 '22 06:05 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 16 '22 15:09 github-actions[bot]