Olatunji Ruwase comments

Results 648 comments of


                                            Olatunji Ruwase

[BUG] lr scheduler get_last_lr() does work with fp16 enabled

@mbetser, thanks for reporting this error. Can you please share a simple script and steps to reproduce this issue?

[BUG] Fail to Resume From Checkpoint with Different GPU Number(Huggingface Trainer + Deepspeed)

@xylian86, can you please help with this?

[BUG] Invalidate trace cache @ step 1: expected module 25, but got module 323,how to resolve it ?

@Griffintaur, can you please see if this new API can help? https://github.com/microsoft/DeepSpeed/pull/4966

Add Compressedbackend for Onebit optimizers

@Liangliang-Ma, apologies for delay. I am still thinking about your last comment, but will not delay this PR.

[BUG] FP32 gradient accumulation result in crash.

@torshie, thanks for the update. We have only tested cpu-offload with zero stage 2, but not with stage 1. I hope you zero stage 2 can work for your scenario,...

[BUG] Zero3: Post backward hook is not triggered for submodules whose inputs have .required_grad=False

@nelyahu, I was unaware, so thanks for bring this to my attention.

[REQUEST] Launcher mode with SSH bypass

@dogacancolak-kensho, you need to create a PR to be reviewed in order to merge your changes. Contributors cannot push directly into main branch as standard practice.

z3 scaled_global_grad_norm: repalce get_global_norm with torch.norm

> i did offline debugging of those failure and improved the code change so it will pass @nelyahu, it great that you narrowed this down. Do you think a unit...

Does Zero-Inference support TP?

Zero-inference is composable with Megatron-style TP. That is the TP is implemented in the client.

Does Zero-Inference support TP?

I assume you are referring to [kv cache offloading](https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/zero_inference) in the latest zero-inference. We did not evaluate with TP, but I expect it should work.