Max Kovalenko comments

Results 7 comments of


                                            Max Kovalenko

trafficstars

Add throughput timer configuration

We've discovered the following issues in the current implementation of the Throughput timer: - The timer invokes synchronize() twice on each step at [start](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/utils/timer.py#L240) and [stop](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/utils/timer.py#L252). - Calling synchronize() ensures...

Add throughput timer configuration

Hi @loadams, all the requested changes are done. If you can please review and trigger the Ci. Thanks

Optimize zero3 fetch params using all_reduce

> @deepcharm, thanks for this interesting approach. Can you share some observed performance gains? @tjruwase We have observed around 9% performance gain on HPU in BERT workloads.

Optimize zero3 fetch params using all_reduce

> Hi @deepcharm > > Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather...

Optimize zero3 fetch params using all_reduce

> > @deepcharm, I was not aware that narrow, cat, copy operations on device tensors incurred high CPU overhead. I will like to learn more. Can you share the reason?...

Optimize zero3 fetch params using all_reduce

Hi @tjruwase, for some reason the PR has been removed from the merge-queue. Can you please re-add it? Thanks

[BUG] Zero3: Post backward hook is not triggered for submodules whose inputs have .required_grad=False

A brutal force solution is to enforce the `.requires_grad` to be `True` for the model input tensors: ``` class PostBackwardFunctionModule(torch.autograd.Function): @staticmethod def forward(ctx, output): ctx.module = module if not output.requires_grad:...