DeepSpeedExamples issues

Results 274 DeepSpeedExamples issues

Sort by recently updated

Inquiry about Embedding Concatenation in DeepSpeed-VisualChat

@yaozhewei First, I'd like to extend my gratitude for the incredible work you've been doing with DeepSpeedExamples. It's truly commendable and has been a great resource for the community. As...

teslacool

step3 answer is not correct

setting as follow: disable HE (HE + zero2 occur error) pp_epochs=1 num_train_epochs=1 disable_actor_dropout per_device_train_batch_size and per_device_mini_train_batch_size are 2 actor loss ![image](https://github.com/microsoft/DeepSpeedExamples/assets/32030790/5aa4c203-7a46-4f3f-b1ef-281eb8323dc4) inference demo： ![image](https://github.com/microsoft/DeepSpeedExamples/assets/32030790/53027cf6-6ae8-4687-a5b6-c0749ae809a6) while actor_ema seems normal but it's...

BaiStone2017

deespeed chat

modeling

No ignore index

If the label corresponding to the pad token is not set to ignore index，how to avoid calculating losses on pad tokens?

yyhycx

Resizing model embedding when loading the model

In create_hf_model, what's the purpose of resizing the model embedding? model.config.end_token_id = tokenizer.eos_token_id -- 44 | model.config.pad_token_id = model.config.eos_token_id 45 | model.resize_token_embeddings(int( 46 | 8 * 47 | math.ceil(len(tokenizer) /...

puyuanOT

NCCL watchdog thread terminated with exception

Hello , I have two h100 devices. I'm running an application via DeepSpeedChat. I ran LLama2-Chat-hf 3 4 times before and finished the training. Either the training starts and explodes...

syngokhan

No such file or directory: '/tmp/data_files/traindata_04ec1ee09de8642fa1c4b659535fca5dca42643897e6a76769bb674209ad4704.pt'

i got something wrong. pls tell me how to generate the file. FileNotFoundError File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in __init__ File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 251, in __init__ super(_open_file, self).__init__(open(name, mode))FileNotFoundError: super(_open_file, self).__init__(open(name,...

hellojiabin

The problem of model parallelism in training

In most cases, I think we need model parallelism more than data parallelism. It is hoped that the model can be trained in parallel, because the current model is very...

wanghao-007

gpt ppo training error

I trained the PPO model, use the gpt. I modified the option of model_name_or_path from opt to gpt2 I passed step 1 and step 2,But An error occurred in step...

lljjgg

deespeed chat

new-config

Error Encountered When Running pipeline_parallelism in deepspeedexample

I encountered the following error while attempting to run the pipeline_parallelism in directory `training`: ``` ValueError: Expected input batch_size (8) to match target batch_size (4). RuntimeError: Mismatch in shape: grad_output[0]...

Sunjnn

Step 2 reward model finetuning: how is the loss computed?

Hi there, I notice that in step 2, the reported scores (i.e. `chosen_mean_scores` and `reject_mean_scores`) are the same as the description: > ... either the end token of the sequence...

ridiculouz

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

Inquiry about Embedding Concatenation in DeepSpeed-VisualChat

step3 answer is not correct

No ignore index

Resizing model embedding when loading the model

NCCL watchdog thread terminated with exception

No such file or directory: '/tmp/data_files/traindata_04ec1ee09de8642fa1c4b659535fca5dca42643897e6a76769bb674209ad4704.pt'

The problem of model parallelism in training

gpt ppo training error

Error Encountered When Running pipeline_parallelism in deepspeedexample

Step 2 reward model finetuning: how is the loss computed?

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard