mdistiller icon indicating copy to clipboard operation
mdistiller copied to clipboard

The training loss

Open Vickeyhw opened this issue 1 year ago • 5 comments

Thanks for your great work! When I run the code use: python3 tools/train.py --cfg configs/imagenet/r34_r18/dot.yaml
The training loss is much larger than the kd method in the first few epochs, and the test acc is also low, is it normal? 1701193501070

Vickeyhw avatar Nov 28 '23 17:11 Vickeyhw

The loss scale is too large. Did you change the batch-size or num-gpus?

Zzzzz1 avatar Nov 29 '23 06:11 Zzzzz1

@Zzzzz1 I use the original batch size 512 on 8 2080ti. After re-ran the code, I got the following results: 1701241395192 It seems still unstable and much worse than the vannila kd.

Vickeyhw avatar Nov 29 '23 07:11 Vickeyhw

@Vickeyhw How long does it take you to run an epoch please, I find it very strange that it takes me 100 minutes to run a 1/4 Epoch on 8*3090.

JinYu1998 avatar Nov 30 '23 12:11 JinYu1998

@JinYu1998 23min/epoch.

Vickeyhw avatar Nov 30 '23 14:11 Vickeyhw

@JinYu1998 23min/epoch.

Thanks for your response, I think I've identified the problem. Since my data is not on SSD, the io issue is causing slow training...

JinYu1998 avatar Nov 30 '23 14:11 JinYu1998