mmrazor icon indicating copy to clipboard operation
mmrazor copied to clipboard

[Bug] Custom Distillation MMSeg CWD loss nan problem

Open Priyanshu88 opened this issue 2 years ago • 3 comments

Describe the bug

I am training the segnext_l model as a standard teacher on my data and keeping the checkpoint obtained for distillation(mmseg/cwd) from segnext_l--->segnext_tiny. When doing this after starting few iterations i am getting all losses as nan for all upcoming iterations.

I am using all the latest versions .

image

The results of student model also remains 0.

Priyanshu88 avatar Apr 12 '24 07:04 Priyanshu88

I face the same problem like you. After I doing several times experiment, I think this is because the lr rate is too large which makes gradient explosion. In mmrazor schedules modules, they do learning rate warm up, which will first make lr larger and the reduce lr by weight decay. So I change my learning rate number(which make it smaller) it works. I hope this can help you.

tori-hotaru avatar Oct 09 '24 07:10 tori-hotaru

I am also trying to distill a custom trained segnext-l to segnext-t. Still getting NaN after setting the lr to be small. This is with KL loss

 distill_losses=dict(
            loss_kl=dict(type='KLDivergence', tau=1, reduction='mean', loss_weight=0.1)
        ),

Curious if you guys figured out a definitive fix.

cravies avatar Aug 11 '25 02:08 cravies

@cravies How many GPUs you used to train your model? If your GPU numbers are insufficient, you need to change the learning rate with linear scale rule which is shown in mmdet. And also a optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2)) may relieve this problem. By the way, in my test, KL divergence is much more unstable than MSE. If you want to do feature distillation, i advise to use MSE rather than KL divergence.

tori-hotaru avatar Aug 11 '25 07:08 tori-hotaru