Shi Yu

https://yu-shi.github.io/ [email protected]

@thunlp, @OpenMatch Pittsburgh, PA Ph.D. student, Tsinghua University Visiting Scholar at CMU

Results 18 comments of


                                            Shi Yu

Why to scale the loss when DDP trainning

To my understanding, backward, gradient clipping & weight updating seem to be based on the scaled loss, and the unscaled one is only for logging?

Why to scale the loss when DDP trainning

Thank you for your reply, that's interesting! I didn't realize that `all_gather` is not differentiable. I think the mechanism is like what this article describes: https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8, isn't it?

Why to scale the loss when DDP trainning

OK, thanks! It would be nicer if you describe more on it in the comments of the code :)

复现Roberta-Large和ELECTRA-Large的问题

您试试加大batch size可不可以呢？您可以采用多卡训练或者gradient accumulation

OOP课程学习笔记

OK

期末复习——前9L

OK

std::forward介绍（补充参数转发知识）

OK

正则表达式的拓展

OK

课程笔记

OK

OOP_THU 学习笔记 rearranged

OK

1
2
›