DeepSpeedExamples issues

When training GPT-2 with Zero-3, some parameters will be missing when saving the model

I replaced the model in steps 1 and 2 with a GPT-2 model: [IDEA-CCNL/Wenzhong-GPT2-110M](https://huggingface.co/IDEA-CCNL/Wenzhong-GPT2-110M). Then use Zero-3 for training, the command is as follows: ``` python train.py --actor-zero-stage 3 --actor-model...

koking0

Come into this error when evaluate model in the sft step:RuntimeError: Error(s) in loading state_dict for OPTForCausalLM: size mismatch for model.decoder.embed_tokens.weight: copying a param with shape torch.Size([50272, 2048]) from checkpoint, the shape in current model is torch.Size([50265, 2048]). size mismatch for lm_head.weight: copying a param with shape torch.Size([50272, 2048]) from checkpoint, the shape in current model is torch.Size([50265, 2048]). You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

1

Modas-Li

About release date for Llama system support

I'm very interested in the new features that have been announced. Can you please provide us with some information on when we can expect "System support and finetuning for LLaMA"...

rockstone533

Add snip_momentum structured pruning example with 80% sparsity ratio

This PR is used to demonstrate the functionality of snip_momentum structured pruning algo implemented in [here](https://github.com/microsoft/DeepSpeed/pull/3300). User can reproduce below result by running `source ./bash_script/pruning_sparse_snip_momentum.sh` with the PR mentioned at...

ftian1

fix ppo_trainer generate and scores calculation in stage 2

1

### A quick fix for bugs I see when go through the code 1. Wrong scores calculation in step2 reward model training It might related to issue334 [https://github.com/microsoft/DeepSpeedExamples/issues/334](url) 2. Wrongly...

nepetune233

Zheweiyao/fixing training acc

1. Select data for better convergence 2. Make dropout as an option 3. Increase step-1 training epochs 4. Script updates 5. Other things

yaozhewei

Error when using BLOOMZ for reward model training

1

Hello, I‘m tring to use BLOOMZ for reward model training, and get error: ``` Traceback (most recent call last): File "/users5/xydu/ChatGPT/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/../../main.py", line 349, in main() File "/users5/xydu/ChatGPT/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/../../main.py", line 303, in...

Luoyang144

deespeed chat

Step1 training failed

1

![image](https://user-images.githubusercontent.com/57927336/232401115-a1e1788e-c272-4742-9c16-e35fc266f1df.png) ![image](https://user-images.githubusercontent.com/57927336/232401211-fe634f9b-ed4f-469d-94b1-6c994c19ffee.png) ![image](https://user-images.githubusercontent.com/57927336/232401389-ed3e9057-766f-46c9-bc6e-563d788a2b45.png) My card is v100, but I get this error when running the training script for step1

omoiji