GSM
GSM copied to clipboard
Confused with the codes using grads clipping and accumulation simultaneously
Take args.iter_size==2 for example, I think the clipped and accumulated grads of your codes are clip(clip(grads1)+grads2), not clip(grads1+grads2), which makes more sense for me.
I haven't run the code yet, I just wonder whether this is a problem.
The second case indeed makes more sense. However, I am not sure if it would make a significant impact on the final performance of the model. I will update the code with the second setting and have a run later.