Olatunji Ruwase comments

Results 648 comments of


                                            Olatunji Ruwase

[BUG] Zero2 offload overflow

> other three, but everytime the overflow happens, all the 4 workers overflow. I don't think this is a normal situation. @QingtaoLi1, this is probably because overflow checking is based...

[BUG] Error with nn.transformers layer size with Zero stage 3

Closing due to lack of response. Please reopen if needed.

[BUG] deepspeed overlap_comm data race

@yangyihang-bytedance, can you please confirm if this was fixed by #5606? Thanks!

Failed to reproduce the offload example with huggingface transformers

@LaosGAmin, can you share your log or stack trace?

Failed to reproduce the offload example with huggingface transformers

@LaosGAmin, yes the error message will contain the stack trace. Also, you can share the full output log.

Failed to reproduce the offload example with huggingface transformers

@LaosGAmin, sorry I previously misunderstood your question. Now, I understand that you are trying to understand the memory consumption of your run. Can you share how you are currently measuring...

Failed to reproduce the offload example with huggingface transformers

@LaosGAmin, both `nvidia-smi` and `torch.cuda.max_memory_reserved()` report more than current GPU memory consumption. A more precise API is to use `torch.cuda.memory_allocated`: https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html#torch.cuda.memory_allocated. You can also instrument your code with DeepSpeed utility...

Dont_change_device for parameters in initialization

@larry-fuy, to enable zero stage 3 offloading to cpu/nvme, `enabled` must be `True` in `deepspeed.zero.Init()`. Please see this [tutorial](https://www.deepspeed.ai/tutorials/zero/#training-trillion-scale-models-with-zero-infinity) for using this feature (a.k.a., zero-infinity). Here are answers to your...

[BUG] `Assert Error: assert buffer.grad is not None` & `RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn` During pipeline parallelism

> **To Reproduce** > Uploading all codings: > [deep_speed_issue.zip](https://github.com/user-attachments/files/20023023/deep_speed_issue.zip) @mmkjj thanks for providing a repro. Instead of zip file, can you please share as gist or repo? Thanks!

[BUG] - Multiple 5090s failing on deepspeed.initialize()

@Oruli, I noticed in your OP that the failure occurs during a `send/recv` operation. Can you also try the p2p tests in the communication benchmark suite? https://github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/communication