DeepSpeedExamples issues

The reward value did not increase.

1

When I run the demo ( step3_rlhf_finetuning/training_scripts/opt/single_node/run_1.3b.sh) without any change , the reward dose not increase. Is it normal? I would appreciate it if anyone can provide a normal reward...

Sun-Shiqi

Adding two loss from actor will lead to an error " gradient computed twice for this partition"

4

When training the ppo model, I turned on the gradient_checkpointing_enable. If you want to calculate ptx loss, then actor will forward twice. In your code, these two loss are executed...

piekey1994

deespeed chat

new-config

modeling

Confusion about Deepspeed Inference

1

Hi, I read the deepspeed docs and have the following confusion: (1) What's the difference between these methods when in inferencing LLMs? a. deepspeed.initialize and then write code to generate...

ZekaiGalaxy

Fix labels & eos_token for SFT

Fixed two issues: * Padding should be ignored in training. Their labels should be set to `-100` for `CrossEntropyLoss` to ignore them. * Append correct `eos_token` to the response text....

li-plus

`AttributeError: readonly attribute` while trying to run training/HelloDeepSpeed

I am trying to run `training/HelloDeepSpeed` example on a fresh python virtualenv but getting below error. I have installed dependencies using https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/requirements.txt ``` Traceback (most recent call last): File "/media/home/hemant/src/DeepSpeedExamples/training/HelloDeepSpeed/train_bert.py",...

htjain

Benchmark mii stalled and crashed

I have benchmarked result of mii with the script of run_example.sh which located at "DeepSpeedExamples/benchmarks/inference/mii" in the repository, but it stalled as follows: ![image](https://github.com/microsoft/DeepSpeedExamples/assets/32950022/b681615b-0abc-4f6a-9123-6624989902e4) Then after a few minutes it...

Albert-Zhao-2020

Cannot load the previous model weights when using ZeRO 3 optimizer in DeepSpeed Chat

4

**Problem:** When I got a previously-trained model state dict file, e.g., a reward model named `PATH/pytorch_model.bin`. When I try to reload it for further training using ZeRO3 optimizer, an error...

caoyu-noob

deespeed chat

new-config

Might be a bug of hibrid engine : In Step3 wrong generation secquence when hibrid engine is enabled.

7

When using where using hybrid engine, The output sequence always be 'a a a a ', while if I disabled hybrid engine，the output sequence is correct here is my log...

laoda513

deespeed chat

hybrid engine

[bug]AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group'

4

my training environment is a docker image pulled from `deepspeed/deepspeed:v072_torch112_cu117` and i run it with `docker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network train-net --name fuyx-work -v...

qingchu123

bug

deespeed chat

hybrid engine

[BUG in Stable Diffusion inference] There's an error on CUDAGraph when using deepspeed inference. How to fix it?

2

【Repo Link】: [Stable-Diffusion inference](https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/stable-diffusion) 【The Command used to run】: ```deepspeed --num_gpus 1 test-stable-diffusion.py``` 【Envs】: ``` RTX 3090 deepspeed 0.12.6 torch 1.13.1 diffusers 0.26.1 triton 2.0.0.dev20221202 ``` 【Traceback information】: ![image](https://github.com/microsoft/DeepSpeedExamples/assets/61218792/03846283-a343-4a79-8f8a-366300d5323e) It...

foin6

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

The reward value did not increase.

Adding two loss from actor will lead to an error " gradient computed twice for this partition"

Confusion about Deepspeed Inference

Fix labels & eos_token for SFT

`AttributeError: readonly attribute` while trying to run training/HelloDeepSpeed

Benchmark mii stalled and crashed

Cannot load the previous model weights when using ZeRO 3 optimizer in DeepSpeed Chat

Might be a bug of hibrid engine : In Step3 wrong generation secquence when hibrid engine is enabled.

[bug]AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group'

[BUG in Stable Diffusion inference] There's an error on CUDAGraph when using deepspeed inference. How to fix it?

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard