DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Example models using DeepSpeed

Results 274 DeepSpeedExamples issues
Sort by recently updated
recently updated
newest added

when I run code bash training_scripts/single_node/run_1.3b.sh , meet error ```shell ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.0961456298828125 seconds Loading extension module fused_adam......

bug
deespeed chat

![image](https://user-images.githubusercontent.com/13724286/232206508-a702748c-3537-43fc-9755-e73ed1131fa6.png) ![image](https://user-images.githubusercontent.com/13724286/232206537-24ffaccd-fb5a-4958-a8fd-15a99095bfcb.png) My environments setting: deepspeed==0.9.0, torch==2.0.0+cu117 CUDA Version: 11.0 pretrained model is facebook/opt-350m Who can help me solve this problem? Thanks

bug
deespeed chat

Thanks for helping to solve this problem ![13261cfa8f343e420cb3c1e845dede1](https://user-images.githubusercontent.com/36975782/232207247-adcaacee-1ac6-43e5-bcc1-d8290c62049f.jpg)

I run it on Ubunu 20.04 with 2 3090 cards, it always get stuck, py-spy dump shows: Process 46930: /home/qi/anaconda3/envs/deepspeed/bin/python -u main.py --local_rank=0 --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage...

(deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node ---=== Running Step 1 ===--- Running: bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_13b.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b GPU usage rate: (deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ nvidia-smi Sat Apr 15...

bug
deespeed chat

When I run step1_supervised_finetuning, I found that the memory of the opt-1.3B model occupies about 15G, as if lora does not work, but when I print the trainable parameters, it...

bug
deespeed chat

https://github.com/microsoft/DeepSpeedExamples/blob/cd19b3bf1e5b60dd73b09c7463da4eedada1eed7/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L234 there are two parameters in DistributedSampler ```python num_replicas (int, optional): Number of processes participating in distributed training. By default, :attr:`world_size` is retrieved from the current distributed group. rank (int,...

env ``` gpu: 4*A100 80G pytorch: 1.13.1 cuda version: 11.7 deepspeed: 0.9.0 transformers: 4.28.0.dev ``` run script ``` OUTPUT=$1 ZERO_STAGE=3 if [ "$OUTPUT" == "" ]; then OUTPUT=./output fi if...

Is there a way to log experiments to wandb? e.g., loss, lr and customized metrics.

I am new to using LLM, and is it possible to just download the open source LM and have multiple conversations without fine-tuning?

bug
question