Olatunji Ruwase
Olatunji Ruwase
@xylian86, thanks for this great work. Can you please add convergence curves of an HF model as demo?
@Coobiw, you can you use the [GatheredParameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.GatheredParameters) context manager which will automatically gather the parameters within the context, and release on exit. You can see a simple example usage of...
@Coobiw, can you share your full script to help us repro on our side? Is this a dense or MoE model? In terms of debugging, can you use prints to...
Another cause of hanging like this is if prompt length or generation length is different across the GPUs. This is because zero-inference is data-parallel algorithm
@Coobiw, I think we need to first confirm that different prompt/generation lengths are responsible. Can you force all the ranks to process the exact same prompt?
@bug-fixed, are you able to share repro for `zero3-offload` case? Thanks!
> Hi @tjruwase, Could you review this pls? @i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer...
@dogacancolak-kensho, thanks for offering a PR for this useful enhancement. Please submit the PR at your convenience. Thanks!
@nelyahu, thanks for the cleanup work.
@Orion-Zheng, this is expected because universal checkpointing requires some metadata to be saved by the client in the checkpoint. At this time, we have only modified Megatron-DeepSpeed [client ](https://github.com/microsoft/Megatron-DeepSpeed/blob/bcedecd1ff788d4d363f3365fd396053a08d65be/megatron/checkpointing.py#L259)to save...