DeepSpeedExamples issues

Results 274 DeepSpeedExamples issues

Sort by recently updated

run deepspeed_chat example code error

when I run code bash training_scripts/single_node/run_1.3b.sh , meet error ```shell ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.0961456298828125 seconds Loading extension module fused_adam......

bestpredicts

bug

deespeed chat

when I am running RLHF script, I encountered a error

![image](https://user-images.githubusercontent.com/13724286/232206508-a702748c-3537-43fc-9755-e73ed1131fa6.png) ![image](https://user-images.githubusercontent.com/13724286/232206537-24ffaccd-fb5a-4958-a8fd-15a99095bfcb.png) My environments setting: deepspeed==0.9.0, torch==2.0.0+cu117 CUDA Version: 11.0 pretrained model is facebook/opt-350m Who can help me solve this problem? Thanks

liuzhiyong01

bug

deespeed chat

DeepSpeed op builder

Thanks for helping to solve this problem ![13261cfa8f343e420cb3c1e845dede1](https://user-images.githubusercontent.com/36975782/232207247-adcaacee-1ac6-43e5-bcc1-d8290c62049f.jpg)

wangshuo6699

stuck running >>bash training_scripts/single_gpu/run_1.3b.sh

I run it on Ubunu 20.04 with 2 3090 cards, it always get stuck, py-spy dump shows: Process 46930: /home/qi/anaconda3/envs/deepspeed/bin/python -u main.py --local_rank=0 --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage...

woshialex

Single node multi card training failed

(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node ---=== Running Step 1 ===--- Running: bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_13b.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b GPU usage rate: (deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ nvidia-smi Sat Apr 15...

menkeyi

bug

deespeed chat

It seems that lora does not work

When I run step1_supervised_finetuning, I found that the memory of the opt-1.3B model occupies about 15G, as if lora does not work, but when I print the trainable parameters, it...

blldd

bug

deespeed chat

Is the usage of DistributedSampler correct?

https://github.com/microsoft/DeepSpeedExamples/blob/cd19b3bf1e5b60dd73b09c7463da4eedada1eed7/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L234 there are two parameters in DistributedSampler ```python num_replicas (int, optional): Number of processes participating in distributed training. By default, :attr:`world_size` is retrieved from the current distributed group. rank (int,...

nickyi1990

bug

question

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

run deepspeed_chat example code error

when I am running RLHF script, I encountered a error

DeepSpeed op builder

stuck running >>bash training_scripts/single_gpu/run_1.3b.sh

Single node multi card training failed

It seems that lora does not work

Is the usage of DistributedSampler correct?

step1-sft use lora failed

Logging experiments to wandb

Does the model in this framework have to be trained to have conversations?

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard