DeepSpeedExamples issues

AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' in step 3.

9

│ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py │ │ :296 in create_inference_containers │ │ │ │ 293 │ │ │ │ │ self._orig_modules_others.append(child) │ │ 294 │ │ │ │ │ self._orig_fwds_others.append(child.forward) │ │ 295...

Arain-sh

deespeed chat

hybrid engine

deepspeed-chat example script run error at step2

I run command: `python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node` when the process run into `step 2`: ``` Launch command: bash /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m ``` we encounter the following error,...

vpegasus

A few questions for README of stage 3 (RL section)

1

My questions are mostly for the stage 3, according to the doc https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/README.md it says that ``` If you don't have step 1 and step 2 models. You may simply...

Emerald01

Fix RLHF loss metrics & single-gpu training script

2

This PR fixes: 1. the actor/critic mean loss calculation 2. step-3 training script for 1.3b model on single gpu 3. some typos

li-plus

Step 2 exited with non-zero status 2

1

in step2， how to slove this question? ![image](https://user-images.githubusercontent.com/128198109/233772944-88baa8a3-a45f-439a-a719-100cc64a305f.png) @codedecde

awelldone

deespeed chat

about the training details of step3 in DeepSpeed-Chat: PPO

1

Regarding the two parts of generation training data and PPO training in the code(applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py), I think that the current training is more like the onPolicy method. Because per_device_train_batch_size==per_device_mini_train_batch_size, now the...

guijuzhejiang

deespeed chat

How to run multinode script in slurm cluster?

I want to launch the run_66b RLHF in slurm cluster. I tried to find some tutorial, but failed.

wang-zerui

[step1_supervised_finetuning] run_chinese.sh error with deepspeed config

When I run a script `bash training_scripts/other_language/run_chinese.sh`, I encounter a problem. ``` Traceback (most recent call last): File "xxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 339, in main() File "xxx/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 284, in main model,...

xf4fresh

Error after changing the model from opt to gpt

2

I trained the PPO model, use the gpt. I modified the option of model_name_or_path from opt to gpt2 I passed step 1 and step 2,But An error occurred in step...

lljjgg

deespeed chat

new-config

hybrid engine

nvcc compile error reduction_utils.h(171) error: no operator "<" matches these operands FAILED: layer_norm.cuda.o

Is there anyone else meet such problem? Single_gpu model with 1.3B model, the two previous steps: step1 and step2 are both successfully complete, but the step3 yields errors when **nvcc**...

WXFMAV

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' in step 3.

deepspeed-chat example script run error at step2

A few questions for README of stage 3 (RL section)

Fix RLHF loss metrics & single-gpu training script

Step 2 exited with non-zero status 2

about the training details of step3 in DeepSpeed-Chat: PPO

How to run multinode script in slurm cluster?

[step1_supervised_finetuning] run_chinese.sh error with deepspeed config

Error after changing the model from opt to gpt

nvcc compile error reduction_utils.h(171) error: no operator "<" matches these operands FAILED: layer_norm.cuda.o

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard