Olatunji Ruwase
Olatunji Ruwase
> other three, but everytime the overflow happens, all the 4 workers overflow. I don't think this is a normal situation. @QingtaoLi1, this is probably because overflow checking is based...
Closing due to lack of response. Please reopen if needed.
@yangyihang-bytedance, can you please confirm if this was fixed by #5606? Thanks!
@LaosGAmin, can you share your log or stack trace?
@LaosGAmin, yes the error message will contain the stack trace. Also, you can share the full output log.
@LaosGAmin, sorry I previously misunderstood your question. Now, I understand that you are trying to understand the memory consumption of your run. Can you share how you are currently measuring...
@LaosGAmin, both `nvidia-smi` and `torch.cuda.max_memory_reserved()` report more than current GPU memory consumption. A more precise API is to use `torch.cuda.memory_allocated`: https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html#torch.cuda.memory_allocated. You can also instrument your code with DeepSpeed utility...
@larry-fuy, to enable zero stage 3 offloading to cpu/nvme, `enabled` must be `True` in `deepspeed.zero.Init()`. Please see this [tutorial](https://www.deepspeed.ai/tutorials/zero/#training-trillion-scale-models-with-zero-infinity) for using this feature (a.k.a., zero-infinity). Here are answers to your...
> **To Reproduce** > Uploading all codings: > [deep_speed_issue.zip](https://github.com/user-attachments/files/20023023/deep_speed_issue.zip) @mmkjj thanks for providing a repro. Instead of zip file, can you please share as gist or repo? Thanks!
@Oruli, I noticed in your OP that the failure occurs during a `send/recv` operation. Can you also try the p2p tests in the communication benchmark suite? https://github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/communication