AnhLD2610
AnhLD2610
I have the same problem. Can anyone fix it?
 both the dist_func and compute_forward_kl_divergence should be the same KL right
 but in the paper you report that akl gave the best result that make me confuse. So the s2t_kd_loss (kd in the teacher space) is forward and t2s_kd_loss (kd...
Thank you for the reply
i found that training with akl has worse performance than normal kl. Am i missing something? As your paper report that AKL has the best performance. From mistral to tinyllama
 i have try to reproduce your result on tinyllama and teacher is mistral but the result in S-NI and UnNI lower than the paper report, am i missing something,...
i use the forward KL, but the adaptive KL give the same result, not change much.
I think the results may vary depending on the GPU. Can I ask for the code you used to get the results of SeqKD?
Thank you for your quick and helpful response!
> These results look normal for forward KL. > > For AKL, you can try to modify the `.le()` to `.lt()` in the following line: > > https://github.com/songmzhang/DSKD/blob/57d2290ce2d448be6293d97a82c6608addf33cdb/code/criterions/various_divergence.py#L172 > >...