ademxapp
ademxapp copied to clipboard
Training on VOC from Scratch
Hi,
I am attempting to train this network on VOC from scratch, essentially trying to recreate the pre-trained weights available for download; however, after 70+ epochs, my model is still just predicting background for an mIOU of 3.49%. Here is the command I am running to train:
python issegm/voc.py --gpus 1,2,3 --split train --data-root data/VOCdevkit/ --output train_out/ --model voc_rna-a1_cls21 --batch-images 12 --crop-size
500 --origin-size 2048 --scale-rate-range 0.7,1.3 --lr-type fixed --base-lr 0.0016 --to-epoch 140 --kvstore local --prefetch-threads 4 --prefetcher thread --backward-do-mirror
Inside data/VOCdevkit/VOC2012 I have the original download of JPEGImages and SegmentationClass, which provides the full color segmentation images. Any help would be much appreciated.
Here's a snippet of output that may or may not help, showing fcn_valid moving a lot. I'm not entirely sure what the output means, so any explanation on what it is could be useful.
2019-04-11 15:00:09,073 Host Epoch[78] Batch [66-67] Speed: 11.93 samples/sec fcn_valid=0.623302 2019-04-11 15:00:10,056 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:10,058 Host Labels: 0 0.6 -1.0 Waited for 2.59876251221e-05 seconds 2019-04-11 15:00:10,075 Host Epoch[78] Batch [67-68] Speed: 11.98 samples/sec fcn_valid=0.644102 2019-04-11 15:00:10,076 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:11,055 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:11,056 Host Labels: 0 0.6 -1.0 Waited for 3.50475311279e-05 seconds 2019-04-11 15:00:11,074 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:11,077 Host Epoch[78] Batch [68-69] Speed: 11.98 samples/sec fcn_valid=0.632405 2019-04-11 15:00:12,056 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:12,058 Host Labels: 0 0.6 -1.0 Waited for 2.50339508057e-05 seconds 2019-04-11 15:00:12,074 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:12,077 Host Epoch[78] Batch [69-70] Speed: 12.00 samples/sec fcn_valid=0.775874 2019-04-11 15:00:13,057 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:13,058 Host Labels: 0 0.6 -1.0 Waited for 2.59876251221e-05 seconds 2019-04-11 15:00:13,074 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:13,077 Host Epoch[78] Batch [70-71] Speed: 12.01 samples/sec fcn_valid=0.562744 2019-04-11 15:00:14,056 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:14,058 Host Labels: 0 0.6 -1.0 Waited for 0.000184059143066 seconds 2019-04-11 15:00:14,074 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:14,075 Host Epoch[78] Batch [71-72] Speed: 12.03 samples/sec fcn_valid=0.552027
After much debugging, I found that part of my issue was apparently that I was training without initializing the weights, so my predictions quickly converged to a bunch of NaN's. I decided to retrain, initializing with the ImageNet weights like so
Host start with arguments Namespace(backward_do_mirror=True, base_lr=0.0016, batch_images=12, cache_images=None, check_start=1, check_step=4, crop_size=500, data_root='data/VOCdevkit/', dataset=None, debug=False, from_epoch=0, gpus='1,2,3', kvstore='local', log_file='voc_rna-a1_cls21.log', lr_steps=None, lr_type='fixed', model='voc_rna-a1_cls21', origin_size=2048, output='train+_out/', phase='train', prefetch_threads=4, prefetcher='thread', save_predictions=False, save_results=True, scale_rate_range='0.7,1.3', split='train+', stop_epoch=None, test_flipping=False, test_scales=None, test_steps=1, to_epoch=500, weight_decay=0.0005, weights='models/ilsvrc-cls_rna-a_cls1000_ep-0001.params')
Meanwhile, I ran validation on the validation and train+ sets every 5 epochs to track training progress. Performance on the validation set began to stabilize around 250 epochs around 45 mIOU, so I began then reducing the learning rate like so
2019-04-19 16:52:48,408 Host start with arguments Namespace(backward_do_mirror=True, base_lr=0.0016, batch_images=12, cache_images=None, check_start=1, check_step=4, crop_size=500, data_root='data/VOCdevkit/', dataset=None, debug=False, from_epoch=240, gpus='1,2,3', kvstore='local', log_file='voc_rna-a1_cls21.log', lr_steps=None, lr_type='linear', model='voc_rna-a1_cls21', origin_size=2048, output='train+_outp2/', phase='train', prefetch_threads=4, prefetcher='thread', save_predictions=False, save_results=True, scale_rate_range='0.7,1.3', split='train+', stop_epoch=None, test_flipping=False, test_scales=None, test_steps=1, to_epoch=500, weight_decay=0.0005, weights='train+_out/voc_rna-a1_cls21_ep-0240.params')
Now, after a total of about 410 epochs (started reducing learning rate from 240), I am still only achieving a max of 54.77 mIOU on the validation set. This is very much lower than the results presented in the paper. Any advice on how to improve would be greatly appreciated.
Hi, @mcever , I'm also trying to reproduce the results on VOC 2012 dataset. Have you reproduced the results as paper reported? If you have did it, can you share your training command?