DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Example models using DeepSpeed
I try to run RLHF for my previously trained Actor and Reward model. However, I encounter the following Exception: ``` Traceback (most recent call last): File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in...
When using the official default configuration with a single V100-32G, I found this will cause OOM with the whole pipline. And according to other issues mentioned above, I changed the...
Does this program supports tensorboard? Could not find any logs of tensorbard.
Steps 1 and 2 are running normally. When running step 3, I encountered an OOM (out of memory) issue again. Even when the batch size was set to 1, it...
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 0; 31.75 GiB total capacity; 23.21 GiB already allocated; 2.43 GiB free; 25.59 GiB reserved in total by PyTorch)...
Hi, I am trying to train a GPT-2 model using "DeepSpeed-Chat” code. Bur in step 1, when I use the "--offload", I got a error. below is the problem: 
I run the training script in a multi-node env: training/step1_supervised_finetuning/training_scripts/multi_node/run_66b.sh But it seems that the multi-nodes are not launched successfully and a warning in the log as below: ``` 2023-04-21...
I have V100-32G * 8, when using lora_dim=128 and gradient_checkpointing, training step1 runs well, however training is slow. When I drop gradient_checkpointing to use only_optimize_lora, I got oom. Could you...
below is the original code: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L157 In my experiments, it will oom when dataset size is 500000
Because the operating system of the server is CentOS, errors often occur when installing according to the method provided by the author.