torch-residual-networks icon indicating copy to clipboard operation
torch-residual-networks copied to clipboard

Loss function explode under default settings

Open yangky11 opened this issue 9 years ago • 4 comments

Hi all,

I'm training on CIFAR-10 following the instructions but I found that in most cases the loss function explodes during the first few iterations.
This is how it behaved when it luckily didn't explode. I don't know whether this is the case for others, but maybe the initial learning rate 0.1 is too large?

12.243277549744 
314.09014892578.............................................. 128/50000 ................] ETA: 0ms | Step: 0ms          
684.17578125................................................... 256/50000 ..............] ETA: 1m26s | Step: 1ms        
1731.8382568359................................................ 384/50000 ..............] ETA: 1m40s | Step: 2ms        
1436.0552978516................................................ 512/50000 ..............] ETA: 1m42s | Step: 2ms        
1810.4338378906................................................ 640/50000 ..............] ETA: 1m44s | Step: 2ms        
2016.3845214844................................................ 768/50000 ..............] ETA: 1m42s | Step: 2ms        
1415.6356201172................................................ 896/50000 ..............] ETA: 1m43s | Step: 2ms        
980.73388671875................................................ 1024/50000 .............] ETA: 1m45s | Step: 2ms        
404.52484130859................................................ 1152/50000 .............] ETA: 1m44s | Step: 2ms        
235.41812133789................................................ 1280/50000 .............] ETA: 1m45s | Step: 2ms        
162.14950561523................................................ 1408/50000 .............] ETA: 1m43s | Step: 2ms        
203.14471435547................................................ 1536/50000 .............] ETA: 1m43s | Step: 2ms        
157.7633972168................................................. 1664/50000 .............] ETA: 1m43s | Step: 2ms        
153.45094299316................................................ 1792/50000 .............] ETA: 1m42s | Step: 2ms        
127.98012542725................................................ 1920/50000 .............] ETA: 1m42s | Step: 2ms        
81.274276733398................................................ 2048/50000 .............] ETA: 1m42s | Step: 2ms        
52.629417419434................................................ 2176/50000 .............] ETA: 1m42s | Step: 2ms        
28.258670806885................................................ 2304/50000 .............] ETA: 1m42s | Step: 2ms        
12.342067718506................................................ 2432/50000 .............] ETA: 1m42s | Step: 2ms        
6.292441368103................................................. 2560/50000 .............] ETA: 1m42s | Step: 2ms        
3.0711505413055................................................ 2688/50000 .............] ETA: 1m42s | Step: 2ms        
2.4665925502777................................................ 2816/50000 .............] ETA: 1m42s | Step: 2ms        
2.3633861541748................................................ 2944/50000 .............] ETA: 1m42s | Step: 2ms        
2.3024611473083................................................ 3072/50000 .............] ETA: 1m41s | Step: 2ms        
2.3726959228516................................................ 3200/50000 .............] ETA: 1m41s | Step: 2ms        
2.3351118564606................................................ 3328/50000 .............] ETA: 1m41s | Step: 2ms        
2.3633522987366................................................ 3456/50000 .............] ETA: 1m41s | Step: 2ms        
2.3602793216705................................................ 3584/50000 .............] ETA: 1m41s | Step: 2ms        
2.3885579109192................................................ 3712/50000 .............] ETA: 1m40s | Step: 2ms        
2.3737788200378................................................ 3840/50000 .............] ETA: 1m40s | Step: 2ms

yangky11 avatar Mar 20 '16 13:03 yangky11

got the same issue

hli2020 avatar Apr 05 '16 10:04 hli2020

hm, interesting. You may need to mess with the learning rate; it certainly isn't supposed to explode that first time. It's normal for loss to increase a little bit (from 2 to 3 or so), but it shouldn't explode. (Using RMSprop for example causes loss to explode)

I posted my loss logs on the table on the front page if you're interested. Here's an example for the Nsize=3 (20-layer) network that eventually gets 0.0829 error: https://mjw-xi8mledcnyry.s3.amazonaws.com/experiments/201601141709-AnY56THQt7/Training%20loss.csv

gcr avatar Apr 05 '16 12:04 gcr

problem solved. because some conv and BN layers are not initialized. (train-cifar.lua and residual-layers.lua). I spent a whole day debugging why the loss does not decrease. (the net just randomly guesses and top1 = .1 following all epoches.) Finally target the issue. sorry. rookie to Torch.

hli2020 avatar Apr 05 '16 14:04 hli2020

Oops, sorry! Glad you found the issue. Should we add some initialization code to keep others from being bitten? When I ran the experiments in January, they worked; I wonder if torch's default initialization changed since then requiring your code to be more explicit about it.

gcr avatar Apr 05 '16 23:04 gcr