DeepSpeedExamples
DeepSpeedExamples copied to clipboard
run step3 66b opt report an error out of memory
When I run step three I get an out of memory error. the program error location is 1072 line in /usr/local/lib/python3.10/dist-packages/deepspeed-0.10.1+cbf2f61a-py3.10.egg/deepspeed/runtime/engine.py (if not self.dont_change_device: self.module.to(self.device)) The program completely copied the model onto the gpu causing a memory overflow.What's the solution I used the A100 machine and used offload and offload_reference_model when I resubmitted the program.look forward to your reply
Hi @lljjgg can you please share more information about the machine you are running on? How many A100s does it have? Are they 40GB or 80GB cards? Can you share the exact command you ran? Thank you!
I used 32 machines and the gpu was 80 gigabytes of memory.and zero3 strategy. I find that the model is whole model in “self.module.to(self.device)”. so out of memory error occurred.What I'm thinking about is how do I assign the model to different Gpus before that
Same problem for 33B LLaMA training (OOM), setting inference_tp_size to appropriate parameters may solve this problem, but I will encounter the problem of tensor shape mismatch.