DeepSpeedExamples
DeepSpeedExamples copied to clipboard
reproduce issue about step3 13B rlhf
I tried to reproduce the 13B rlhf training in A100-80GB * 8. I found the default training script here https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_13b.sh
where the per_device_train_batch_size and per_device_mini_train_batch_size is 16, which is different from the benchmark setting, where the batch size is claimed to be 1024. When I increase the per_device_train_batch_size to 32, then it will raise OOM.
Can you provide the 13B step3 training scripts as shown in Benchmark?