DeepSpeedExamples issues

Domino + PP

I’m excited about the recent introduction of Domino and its impressive TP optimization. When I was using deepspeed-domino to better overlap comm & comp in TP, I found domino use...

XZQshiyu

Question to attention computation

Hi, thank you for the amazing demo and doc! I have a question regarding this [section](https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/zero_inference/model-support.md#5-compute-attention-scores) in zero-inference. It is mentioned that `"Thus, our current implementation computes attention scores on...

yuzhenmao

LoRA problem：out of memory when 3b model with lora in 32G GPU with batchsize 2

1

deepspeed --master_port 25604 --num_gpus 1 main.py \ --data_path mydata/ \ --data_split 0,10,0 \ --num_padding_at_beginning 0 \ --model_name_or_path bloom_3b1/ \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --max_seq_len 1024 \ --learning_rate 9.65e-7...

NostalgiaOfTime

A bug in argument parser.

Issue: In the original code: `e2e_rlhf.py` line 68 ``` parser.add_argument( "--reward-model", type=lambda x: x.replace("facebook/opt-", ""), default="350m", choices=("350m"), help="Which facebook/opt-* model to use for Reward (step 2)", ) ``` The choices...

ChenDaiwei-99

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost

4

When I use default command, it seems to use 29500 as master_port. However, the master_port seems unchangable,even when I use "--master_port 29501" or change it using "deepspeed.init_distributed(dist_backend='nccl', distributed_port=config.master_port)" error message:...

lovedoubledan

Fix chatbot

yaozhewei

ModuleNotFoundError: No module named 'utils.data'

3

I know this is a path problem, and I have modified the code to know the parent directory, but I still can't find the package, step1_supervised_finetuning and utils are at...

xtu-xiaoc

Does DeepSpeed's Pipeline-Parallelism optimizer supports skip connections?

In your example you convert the AlexNet into a list of layers: ``` def join_layers(vision_model): layers = [ *vision_model.features, vision_model.avgpool, lambda x: torch.flatten(x, 1), *vision_model.classifier, ] return layers ``` which...

RoyMahlab

One example, multiple config files

I'm wondering if we can take the ZenFlow finetuning example, and extend this example into a test bed of different DeepSpeed technologies. The ZenFlow finetuning example: https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/DeepSpeed-ZenFlow/finetuning The reason is...

delock

how to understand the code for calculating rewards

1

https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py#L148 ``` def compute_rewards(self, prompts, log_probs, ref_log_probs, reward_score, action_mask): kl_divergence_estimate = -self.kl_ctl * (log_probs - ref_log_probs) rewards = kl_divergence_estimate start = prompts.shape[1] - 1 ends = start + action_mask[:, start:].sum(1)...

lyzKF

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

Domino + PP

Question to attention computation

LoRA problem：out of memory when 3b model with lora in 32G GPU with batchsize 2

A bug in argument parser.

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost

Fix chatbot

ModuleNotFoundError: No module named 'utils.data'

Does DeepSpeed's Pipeline-Parallelism optimizer supports skip connections?

One example, multiple config files

how to understand the code for calculating rewards

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard