AnhLD2610

Results 12 comments of AnhLD2610

I have the same problem. Can anyone fix it?

![image](https://github.com/user-attachments/assets/bfd373e0-edba-4e49-9635-88e5e4c4b2b6) both the dist_func and compute_forward_kl_divergence should be the same KL right

![image](https://github.com/user-attachments/assets/026eb53c-259a-4762-9ed9-595fba778134) but in the paper you report that akl gave the best result that make me confuse. So the s2t_kd_loss (kd in the teacher space) is forward and t2s_kd_loss (kd...

Thank you for the reply

i found that training with akl has worse performance than normal kl. Am i missing something? As your paper report that AKL has the best performance. From mistral to tinyllama

![image](https://github.com/user-attachments/assets/d4f4c34b-14f5-4d02-979b-63e7ce1088d3) i have try to reproduce your result on tinyllama and teacher is mistral but the result in S-NI and UnNI lower than the paper report, am i missing something,...

i use the forward KL, but the adaptive KL give the same result, not change much.

I think the results may vary depending on the GPU. Can I ask for the code you used to get the results of SeqKD?

Thank you for your quick and helpful response!

> These results look normal for forward KL. > > For AKL, you can try to modify the `.le()` to `.lt()` in the following line: > > https://github.com/songmzhang/DSKD/blob/57d2290ce2d448be6293d97a82c6608addf33cdb/code/criterions/various_divergence.py#L172 > >...