reproduce issue about step3 13B rlhf

Open SeaOfOcean opened this issue 2 years ago • 0 comments

I tried to reproduce the 13B rlhf training in A100-80GB * 8. I found the default training script here https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_13b.sh

where the per_device_train_batch_size and per_device_mini_train_batch_size is 16, which is different from the benchmark setting, where the batch size is claimed to be 1024. When I increase the per_device_train_batch_size to 32, then it will raise OOM.

Can you provide the 13B step3 training scripts as shown in Benchmark?

Apr 24 '23 06:04 SeaOfOcean