DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

run step3 66b opt report an error out of memory

Open lljjgg opened this issue 2 years ago • 3 comments

When I run step three I get an out of memory error. the program error location is 1072 line in /usr/local/lib/python3.10/dist-packages/deepspeed-0.10.1+cbf2f61a-py3.10.egg/deepspeed/runtime/engine.py (if not self.dont_change_device: self.module.to(self.device)) The program completely copied the model onto the gpu causing a memory overflow.What's the solution I used the A100 machine and used offload and offload_reference_model when I resubmitted the program.look forward to your reply

lljjgg avatar Oct 10 '23 11:10 lljjgg

Hi @lljjgg can you please share more information about the machine you are running on? How many A100s does it have? Are they 40GB or 80GB cards? Can you share the exact command you ran? Thank you!

mrwyattii avatar Oct 10 '23 16:10 mrwyattii

I used 32 machines and the gpu was 80 gigabytes of memory.and zero3 strategy. I find that the model is whole model in “self.module.to(self.device)”. so out of memory error occurred.What I'm thinking about is how do I assign the model to different Gpus before that

lljjgg avatar Oct 11 '23 01:10 lljjgg

Same problem for 33B LLaMA training (OOM), setting inference_tp_size to appropriate parameters may solve this problem, but I will encounter the problem of tensor shape mismatch.

lusongshuo-mt avatar Nov 29 '23 03:11 lusongshuo-mt