Zhong-Yi Li
Results
2
comments of
Zhong-Yi Li
Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation....
I think PyTorch does [automatic differentiation](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html) for you. Baidu realized their own backward function because they want their own optimized version. ([DeepSpeech2, Page 27](https://arxiv.org/pdf/1512.02595.pdf))