mdistiller The training loss

The training loss

Open Vickeyhw opened this issue 1 year ago • 5 comments

Thanks for your great work! When I run the code use: python3 tools/train.py --cfg configs/imagenet/r34_r18/dot.yaml
The training loss is much larger than the kd method in the first few epochs, and the test acc is also low, is it normal? 1701193501070

Nov 28 '23 17:11 Vickeyhw

The loss scale is too large. Did you change the batch-size or num-gpus?

Nov 29 '23 06:11 Zzzzz1

@Zzzzz1 I use the original batch size 512 on 8 2080ti. After re-ran the code, I got the following results: 1701241395192 It seems still unstable and much worse than the vannila kd.

Nov 29 '23 07:11 Vickeyhw

@Vickeyhw How long does it take you to run an epoch please, I find it very strange that it takes me 100 minutes to run a 1/4 Epoch on 8*3090.

Nov 30 '23 12:11 JinYu1998

@JinYu1998 23min/epoch.

Nov 30 '23 14:11 Vickeyhw

@JinYu1998 23min/epoch.

Thanks for your response, I think I've identified the problem. Since my data is not on SSD, the io issue is causing slow training...

Nov 30 '23 14:11 JinYu1998

mdistiller mdistiller copied to clipboard

The training loss

mdistiller
mdistiller copied to clipboard