Zhong-Yi Li

Results 2 comments of Zhong-Yi Li

Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation....

I think PyTorch does [automatic differentiation](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html) for you. Baidu realized their own backward function because they want their own optimized version. ([DeepSpeech2, Page 27](https://arxiv.org/pdf/1512.02595.pdf))