Shi Yu
Shi Yu
To my understanding, backward, gradient clipping & weight updating seem to be based on the scaled loss, and the unscaled one is only for logging?
Thank you for your reply, that's interesting! I didn't realize that `all_gather` is not differentiable. I think the mechanism is like what this article describes: https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8, isn't it?
OK, thanks! It would be nicer if you describe more on it in the comments of the code :)
您试试加大batch size可不可以呢?您可以采用多卡训练或者gradient accumulation
OK
OK
OK
OK