Olatunji Ruwase comments

Results 612 comments of


                                            Olatunji Ruwase

trafficstars

create mininal universal checkpoint info for client state

@xylian86, thanks for this great work. Can you please add convergence curves of an HF model as demo?

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training

@Coobiw, you can you use the [GatheredParameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.GatheredParameters) context manager which will automatically gather the parameters within the context, and release on exit. You can see a simple example usage of...

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training

@Coobiw, can you share your full script to help us repro on our side? Is this a dense or MoE model? In terms of debugging, can you use prints to...

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training

Another cause of hanging like this is if prompt length or generation length is different across the GPUs. This is because zero-inference is data-parallel algorithm

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training

@Coobiw, I think we need to first confirm that different prompt/generation lengths are responsible. Can you force all the ranks to process the exact same prompt?

[BUG] Version >0.14.0 leads to `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`

@bug-fixed, are you able to share repro for `zero3-offload` case? Thanks!

Fix deadlock in PipeEngine._exec_recv_grads

> Hi @tjruwase, Could you review this pls? @i4never, thanks for this PR. This is very old code, which is documented as hacky, and unfortunately the author is no longer...

[REQUEST] Launcher mode with SSH bypass

@dogacancolak-kensho, thanks for offering a PR for this useful enhancement. Please submit the PR at your convenience. Thanks!

inference: remove unused _validate_args function

@nelyahu, thanks for the cleanup work.

[BUG] No `universal_checkpoint_info` in the Accelerate+Deepspeed Checkpoint

@Orion-Zheng, this is expected because universal checkpointing requires some metadata to be saved by the client in the checkpoint. At this time, we have only modified Megatron-DeepSpeed [client ](https://github.com/microsoft/Megatron-DeepSpeed/blob/bcedecd1ff788d4d363f3365fd396053a08d65be/megatron/checkpointing.py#L259)to save...