[Bug] Custom Distillation MMSeg CWD loss nan problem
Describe the bug
I am training the segnext_l model as a standard teacher on my data and keeping the checkpoint obtained for distillation(mmseg/cwd) from segnext_l--->segnext_tiny. When doing this after starting few iterations i am getting all losses as nan for all upcoming iterations.
I am using all the latest versions .
The results of student model also remains 0.
I face the same problem like you. After I doing several times experiment, I think this is because the lr rate is too large which makes gradient explosion. In mmrazor schedules modules, they do learning rate warm up, which will first make lr larger and the reduce lr by weight decay. So I change my learning rate number(which make it smaller) it works. I hope this can help you.
I am also trying to distill a custom trained segnext-l to segnext-t. Still getting NaN after setting the lr to be small. This is with KL loss
distill_losses=dict(
loss_kl=dict(type='KLDivergence', tau=1, reduction='mean', loss_weight=0.1)
),
Curious if you guys figured out a definitive fix.
@cravies How many GPUs you used to train your model? If your GPU numbers are insufficient, you need to change the learning rate with linear scale rule which is shown in mmdet. And also a optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2)) may relieve this problem. By the way, in my test, KL divergence is much more unstable than MSE. If you want to do feature distillation, i advise to use MSE rather than KL divergence.