FedNova icon indicating copy to clipboard operation
FedNova copied to clipboard

Implementation of FedNova

Open omarfoq opened this issue 3 years ago • 1 comments

Hello,

I am not sure about this, but it seems there is a slight mismatch between the implementation of FedNova, and it's description from the paper. It seems that in the current implementation, you compute the cumulative sum of gradient, without multiplying by weights a_{i}, even when etamu is not zero (see below)

# update accumalated local updates
if 'cum_grad' not in param_state:
      param_state['cum_grad'] = torch.clone(d_p).detach()
      param_state['cum_grad'].mul_(local_lr)
else:
      param_state['cum_grad'].add_(local_lr, d_p)

p.data.add_(-local_lr, d_p)

Probably it is needed to add a line in order to multiply d_p by a_i (this variable should be tracked as well), before updating param_state['cum_grad'].

Is what I said correct, or is there anything I am missing?

Thanks in advance for your help

omarfoq avatar Jun 24 '21 12:06 omarfoq

Hi. d_p is initially gradient but at the end of the step, it is a complex update of many factors including gradient, momentum, variance (correction), proximal, .... So param_state['cum_grad'] is the accumulation of step updates d_p, not accumulation of gradients. By some equivalent mathematical transformation, you will see this accumulated update param_state['cum_grad'] is equivalent to the sum of ai[j]*grad_stepj. Let's see page 6 of the paper for better understanding.

nthhiep avatar Dec 23 '21 08:12 nthhiep