Max Kovalenko
Max Kovalenko
We've discovered the following issues in the current implementation of the Throughput timer: - The timer invokes synchronize() twice on each step at [start](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/utils/timer.py#L240) and [stop](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/utils/timer.py#L252). - Calling synchronize() ensures...
Hi @loadams, all the requested changes are done. If you can please review and trigger the Ci. Thanks
> @deepcharm, thanks for this interesting approach. Can you share some observed performance gains? @tjruwase We have observed around 9% performance gain on HPU in BERT workloads.
> Hi @deepcharm > > Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather...
> > @deepcharm, I was not aware that narrow, cat, copy operations on device tensors incurred high CPU overhead. I will like to learn more. Can you share the reason?...
Hi @tjruwase, for some reason the PR has been removed from the merge-queue. Can you please re-add it? Thanks
A brutal force solution is to enforce the `.requires_grad` to be `True` for the model input tensors: ``` class PostBackwardFunctionModule(torch.autograd.Function): @staticmethod def forward(ctx, output): ctx.module = module if not output.requires_grad:...