DeepSpeedExamples issues

Question: Why did you implemented LoRA on your hand instead of using peft?

1

Peft looks more convenient to integrate LoRA. Is it because of policy of your company or is there any reason?

Inference test enhance

Allow specifying the number of quantization groups in the inference test script using a `quantize_groups` argument. This PR is complimentary to PR [#3519](https://github.com/microsoft/DeepSpeed/pull/3519) on the main repo, and should be...

sakogan

deepspeed-chat: print mean stage1/2 loss periodically

2

Print mean loss periodically based on deepspeed 'steps_per_print' configuration. So, mean loss is printed on an optimizer step boundary. To reduce log clutter, only rank 0 loss is printed. This...

mosheisland

In step 3, I met a error when executing self.actor_model.eval()

6

Here is the error I met, seems like the `self._total_batch_size` is `None`, but I don't know the reason ``` File "/path/model_training/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 434, in main out = trainer.generate_experience(batch_prompt['prompt'], File "/path/model_training/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py",...

ZJXNEFU

deespeed chat

hybrid engine

deeepspeed chat 支持pipline 并行吗？

deeepspeed chat 支持pipline 并行吗？ ```[tasklist] ### Tasks ```

mollon650

RuntimeError with BLOOMZ: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

1

I'm facing the above error in both stage 1 and stage 2 when using BLOOMZ 3B and 560M. I tried adding "model.to(device)" and "model.to('cuda')" to main.py but neither worked. The...

karim1104

Something wrong at step1_supervised_finetuning/main.py

this [line](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L299) ```python losses += loss.float() ``` will accumulate grad and result in OOM I think maybe change to this: ```python losses += loss.detach().cpu() ```

zyh3826

Step3: 8 * A100-40G training LLAMA2-7B and OPT-350M out of memory.

2

Hi, In step 3, run the following command and getting "OOM" when Initializing Ref Model (Actor Model initialized perfectly): > Actor_Lr=9.65e-6 Critic_Lr=5e-6 deepspeed --master_port 12346 main.py \ --data_path Dahoas/rm-static \...

GasolSun36

Should it use global_rank as the condition for shared-disk?

https://github.com/microsoft/DeepSpeedExamples/blob/b116838b905430a5fbebe3713a68d90638478aa9/applications/DeepSpeed-Chat/dschat/utils/data/data_utils.py#L301 If a task is run in multiple nodes, it seems that data cache building is redundant in other nodes.

sz128

[DeepSpeedExamples/applications/DeepSpeed-Chat/] Error happened when running step3_rlhf_finetuning in enable_hybrid_engine mode with togethercomputer/GPT-NeoXT-Chat-Base-20B

2

Error info: File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 99, in new_inference_container File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/gptneox.py", line 95, in get_hidden_heads IndexError _container.create_ds_model_config() File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/base.py", line 79, in create_ds_model_config : tuple index out of range return self.client_module.attention.query_key_value.weight.shape[1],...

GxjGit

deespeed chat

hybrid engine

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

Question: Why did you implemented LoRA on your hand instead of using peft?

Inference test enhance

deepspeed-chat: print mean stage1/2 loss periodically

In step 3, I met a error when executing self.actor_model.eval()

deeepspeed chat 支持pipline 并行吗？

RuntimeError with BLOOMZ: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Something wrong at step1_supervised_finetuning/main.py

Step3: 8 * A100-40G training LLAMA2-7B and OPT-350M out of memory.

Should it use global_rank as the condition for shared-disk?

[DeepSpeedExamples/applications/DeepSpeed-Chat/] Error happened when running step3_rlhf_finetuning in enable_hybrid_engine mode with togethercomputer/GPT-NeoXT-Chat-Base-20B

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard