DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Example models using DeepSpeed

Results 274 DeepSpeedExamples issues
Sort by recently updated
recently updated
newest added

@yaozhewei First, I'd like to extend my gratitude for the incredible work you've been doing with DeepSpeedExamples. It's truly commendable and has been a great resource for the community. As...

setting as follow: disable HE (HE + zero2 occur error) pp_epochs=1 num_train_epochs=1 disable_actor_dropout per_device_train_batch_size and per_device_mini_train_batch_size are 2 actor loss ![image](https://github.com/microsoft/DeepSpeedExamples/assets/32030790/5aa4c203-7a46-4f3f-b1ef-281eb8323dc4) inference demo: ![image](https://github.com/microsoft/DeepSpeedExamples/assets/32030790/53027cf6-6ae8-4687-a5b6-c0749ae809a6) while actor_ema seems normal but it's...

deespeed chat
modeling

If the label corresponding to the pad token is not set to ignore index,how to avoid calculating losses on pad tokens?

In create_hf_model, what's the purpose of resizing the model embedding? model.config.end_token_id = tokenizer.eos_token_id -- 44 | model.config.pad_token_id = model.config.eos_token_id 45 | model.resize_token_embeddings(int( 46 | 8 * 47 | math.ceil(len(tokenizer) /...

Hello , I have two h100 devices. I'm running an application via DeepSpeedChat. I ran LLama2-Chat-hf 3 4 times before and finished the training. Either the training starts and explodes...

i got something wrong. pls tell me how to generate the file. FileNotFoundError File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in __init__ File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in __init__ super(_open_file, self).__init__(open(name, mode))FileNotFoundError: super(_open_file, self).__init__(open(name,...

In most cases, I think we need model parallelism more than data parallelism. It is hoped that the model can be trained in parallel, because the current model is very...

I trained the PPO model, use the gpt. I modified the option of model_name_or_path from opt to gpt2 I passed step 1 and step 2,But An error occurred in step...

deespeed chat
new-config

I encountered the following error while attempting to run the pipeline_parallelism in directory `training`: ``` ValueError: Expected input batch_size (8) to match target batch_size (4). RuntimeError: Mismatch in shape: grad_output[0]...

Hi there, I notice that in step 2, the reported scores (i.e. `chosen_mean_scores` and `reject_mean_scores`) are the same as the description: > ... either the end token of the sequence...