DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Example models using DeepSpeed

Results 323 DeepSpeedExamples issues
Sort by recently updated
recently updated
newest added

I try to run RLHF for my previously trained Actor and Reward model. However, I encounter the following Exception: ``` Traceback (most recent call last): File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in...

When using the official default configuration with a single V100-32G, I found this will cause OOM with the whole pipline. And according to other issues mentioned above, I changed the...

deespeed chat

Does this program supports tensorboard? Could not find any logs of tensorbard.

enhancement
deespeed chat

Steps 1 and 2 are running normally. When running step 3, I encountered an OOM (out of memory) issue again. Even when the batch size was set to 1, it...

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 0; 31.75 GiB total capacity; 23.21 GiB already allocated; 2.43 GiB free; 25.59 GiB reserved in total by PyTorch)...

deespeed chat

Hi, I am trying to train a GPT-2 model using "DeepSpeed-Chat” code. Bur in step 1, when I use the "--offload", I got a error. below is the problem: ![image](https://user-images.githubusercontent.com/38311101/233521671-5dc88eaf-df4e-4632-be26-9eab42deb74d.png)

I run the training script in a multi-node env: training/step1_supervised_finetuning/training_scripts/multi_node/run_66b.sh But it seems that the multi-nodes are not launched successfully and a warning in the log as below: ``` 2023-04-21...

deespeed chat
system

I have V100-32G * 8, when using lora_dim=128 and gradient_checkpointing, training step1 runs well, however training is slow. When I drop gradient_checkpointing to use only_optimize_lora, I got oom. Could you...

below is the original code: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L157 In my experiments, it will oom when dataset size is 500000

Because the operating system of the server is CentOS, errors often occur when installing according to the method provided by the author.