DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Example models using DeepSpeed
@yaozhewei First, I'd like to extend my gratitude for the incredible work you've been doing with DeepSpeedExamples. It's truly commendable and has been a great resource for the community. As...
setting as follow: disable HE (HE + zero2 occur error) pp_epochs=1 num_train_epochs=1 disable_actor_dropout per_device_train_batch_size and per_device_mini_train_batch_size are 2 actor loss ![image](https://github.com/microsoft/DeepSpeedExamples/assets/32030790/5aa4c203-7a46-4f3f-b1ef-281eb8323dc4) inference demo: ![image](https://github.com/microsoft/DeepSpeedExamples/assets/32030790/53027cf6-6ae8-4687-a5b6-c0749ae809a6) while actor_ema seems normal but it's...
If the label corresponding to the pad token is not set to ignore index,how to avoid calculating losses on pad tokens?
In create_hf_model, what's the purpose of resizing the model embedding? model.config.end_token_id = tokenizer.eos_token_id -- 44 | model.config.pad_token_id = model.config.eos_token_id 45 | model.resize_token_embeddings(int( 46 | 8 * 47 | math.ceil(len(tokenizer) /...
Hello , I have two h100 devices. I'm running an application via DeepSpeedChat. I ran LLama2-Chat-hf 3 4 times before and finished the training. Either the training starts and explodes...
i got something wrong. pls tell me how to generate the file. FileNotFoundError File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in __init__ File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in __init__ super(_open_file, self).__init__(open(name, mode))FileNotFoundError: super(_open_file, self).__init__(open(name,...
In most cases, I think we need model parallelism more than data parallelism. It is hoped that the model can be trained in parallel, because the current model is very...
I trained the PPO model, use the gpt. I modified the option of model_name_or_path from opt to gpt2 I passed step 1 and step 2,But An error occurred in step...
I encountered the following error while attempting to run the pipeline_parallelism in directory `training`: ``` ValueError: Expected input batch_size (8) to match target batch_size (4). RuntimeError: Mismatch in shape: grad_output[0]...
Hi there, I notice that in step 2, the reported scores (i.e. `chosen_mean_scores` and `reject_mean_scores`) are the same as the description: > ... either the end token of the sequence...