End-to-end-ASR-Pytorch icon indicating copy to clipboard operation
End-to-end-ASR-Pytorch copied to clipboard

Ocd

Open Chung-I opened this issue 5 years ago • 6 comments

implement Optimal Completion Distillation. add a new config named libri_ocd_example.yaml which enables ocd training. Not well tested. Might have bugs inside. temperature annealing not yet implemented. currently equals to 1e-8 (sharpest).

Chung-I avatar Mar 12 '19 08:03 Chung-I

@Alexander-H-Liu I think this is a wonderful PR, can you merge it ASAP?

Liangtaiwan avatar Apr 29 '19 04:04 Liangtaiwan

@Chung-I i notice that u used cross entropy in ocd_loss rather than KLdivergence( which is official in paper 'Optimal Completion Distillation for sequence learning') , is this PR a right implementation for ocd_loss? THX.

xingchensong avatar May 23 '19 08:05 xingchensong

ocd_loss should be like this ? optimal_probs = F.softmax(q_val / temp, dim=-1)

loss += ( optimal_probs * (torch.log(optimal_probs)- F.log_softmax(out_probs[b,:len_sample,:])) ).sum(dim=-1).mean()

xingchensong avatar May 23 '19 08:05 xingchensong

Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation. Consider this: KL(p||q) = ʃ p(x) log [p(x)/q(x)] dx = ʃ p(x) log p(x) dx - ʃ p(x) log q(x) dx = H(p, q) - H(p) .

So H(p, q) - KL(p||q) = H(p) .

H(p), while varying according to different number of targets and different temperature τ, doesn't contribute to gradients: d KL(p||q) / d q = d [H(p, q) - H(p)] / d q = d H(p, q) / d q .

So the two losses are equivalent in backprop despite having different values.

But of course H(p, q) is not a divergence, since divergence requires D(p || q) = 0 if and only if p = q. H(p, q) = H(p) > 0, while KL(p||q) = 0, when p=q.

It's true that if you really want to see how much q differs from p, KL divergence is the right loss to use. But after communicating with Alex (the owner of the repo), we decided to just ignore the H(p) term and use H(p, q) .

Chung-I avatar May 25 '19 03:05 Chung-I

Yes, as the paper indicated, the loss they used is KL divergence; however, when performing backprop in this scenario, the two losses are actually equivalent in terms of gradient calculation. Consider this: KL(p||q) = ʃ p(x) log [p(x)/q(x)] dx = ʃ p(x) log p(x) dx - ʃ p(x) log q(x) dx = H(p, q) - H(p) .

So H(p, q) - KL(p||q) = H(p) .

H(p), while varying according to different number of targets and different temperature τ, doesn't contribute to gradients: d KL(p||q) / d q = d [H(p, q) - H(p)] / d q = d H(p, q) / d q .

So the two losses are equivalent in backprop despite having different values.

But of course H(p, q) is not a divergence, since divergence requires D(p || q) = 0 if and only if p = q. H(p, q) = H(p) > 0, while KL(p||q) = 0, when p=q.

It's true that if you really want to see how much q differs from p, KL divergence is the right loss to use. But after communicating with Alex (the owner of the repo), we decided to just ignore the H(p) term and use H(p, q) .

i see ,THX for ur reply ! There is a question I would like to consult with you:Do we need to complete the backprob ourselves when designing a new loss? Recently I was trying to reproduce CTC(which used dynamic programming algorithms ) . Existing CTC repo such as baidu‘s warp-ctc not only realized the forward part, but also calculated the gradient by hand, but it seems we dont need to do so in ocd_loss , so i 'm confused,Should we calculate the gradient ourselves ?

xingchensong avatar May 25 '19 03:05 xingchensong

I think PyTorch does automatic differentiation for you.

Baidu realized their own backward function because they want their own optimized version. (DeepSpeech2, Page 27)

Chung-I avatar May 27 '19 08:05 Chung-I