deep-face-alignment icon indicating copy to clipboard operation
deep-face-alignment copied to clipboard

spikes

Open notconvergingwtf opened this issue 5 years ago • 8 comments

Hi, do you have any suggestions on the next problem: While training sdu(nadam,lr=0.00025), this is the loss on validation test: image Different model on the same training data was fine Also, while training, lossvalue=nan starts to appear

notconvergingwtf avatar Feb 27 '19 17:02 notconvergingwtf

I just set network.sdu.net_coherent = True and revise line 579 of sym_heatmap.py to coherent_weight = 0.001, it seems this nan problem can be solved.

deepinx avatar Mar 06 '19 08:03 deepinx

Okay, thanks Sorry, but how did you manage to figure this out? Seems, that network.sdu.net_coherent = True stands for leaving only this image transformations, that doesn't affect heatmap? How does this affect accuracy?

notconvergingwtf avatar Mar 06 '19 10:03 notconvergingwtf

I did this following the guides of the origial paper, as in the paper Therefore, we employ the CE loss for Lp-g and the MSE loss for Lp-p, respectively. l is empirically set as 0:001 to guarantee convergence

deepinx avatar Mar 06 '19 11:03 deepinx

Big thanks

notconvergingwtf avatar Mar 06 '19 12:03 notconvergingwtf

Hi, its me again. After some training time, here what i have: image It doesnt look like overfitting on train, may be some problems with convergence.. Have you met the same problem?

notconvergingwtf avatar Mar 11 '19 10:03 notconvergingwtf

What batch size and lr do you use? You can try different batch size or lr, perhaps it can solve your problem.

deepinx avatar Mar 11 '19 14:03 deepinx

Batch size is 16.Lr's are 1e-10 and 2e-6 (on screenshot). Well, as you can see, decreasing lr only delays time till spikes appear

notconvergingwtf avatar Mar 11 '19 14:03 notconvergingwtf

I used batch-size 16 and lr 0.00002 at the first several epochs. The spike did not appear. You can try the following commands:

NETWORK='sdu'
MODELDIR='./model_2d'
mkdir -p "$MODELDIR"
PREFIX="$MODELDIR/$NETWORK"
LOGFILE="$MODELDIR/log_$NETWORK"

CUDA_VISIBLE_DEVICES='0' python -u train.py --network $NETWORK --prefix "$PREFIX" --per-batch-size 16 --lr 0.00002 --lr-step '16000,24000,30000' > "$LOGFILE" 2>&1 &

If this problem still appears, you may check the network parameters in config.py.

deepinx avatar Mar 11 '19 15:03 deepinx