DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Run 1.6 billon demo is much slow than description on A100 GPU?

Open tcluoct opened this issue 2 years ago • 3 comments

I run 1.6 billion parameters demo, it cost 1:46:27 on first step and 2:12:56 on second step. it's much slower than below.

Actor: OPT-1.3B Reward: OPT-350M | 2900 Sec | 670 Sec | 1.2hr | 2.2hr

Is there any thing to tune other than run: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

Thanks

tcluoct avatar Apr 19 '23 13:04 tcluoct

HI @tcluoct which GPU are you using to run this example? The values we reported are using A6000 GPU.

mrwyattii avatar Apr 19 '23 18:04 mrwyattii

I'm using A100 which have better performance than A6000.

tcluoct avatar Apr 20 '23 00:04 tcluoct

After the train, when i run the final model. It responds very weird. image

tcluoct avatar Apr 20 '23 00:04 tcluoct

@tcluoct I ran step1 with A6000 using the latest version of DeepSpeed and DeepSpeedExamples, and it was much faster than the time you reported.

Can you update them to the latest version and measure the time for data loading/forward/backward/step?

tohtana avatar May 08 '23 21:05 tohtana

Closing because we have no further information. Feel free to reopen if the problem still exists.

tohtana avatar May 22 '23 21:05 tohtana