ademxapp icon indicating copy to clipboard operation
ademxapp copied to clipboard

Training on VOC from Scratch

Open mcever opened this issue 5 years ago • 2 comments

Hi,

I am attempting to train this network on VOC from scratch, essentially trying to recreate the pre-trained weights available for download; however, after 70+ epochs, my model is still just predicting background for an mIOU of 3.49%. Here is the command I am running to train:

python issegm/voc.py --gpus 1,2,3 --split train --data-root data/VOCdevkit/ --output train_out/ --model voc_rna-a1_cls21 --batch-images 12 --crop-size
500 --origin-size 2048 --scale-rate-range 0.7,1.3 --lr-type fixed --base-lr 0.0016 --to-epoch 140 --kvstore local --prefetch-threads 4 --prefetcher thread --backward-do-mirror

Inside data/VOCdevkit/VOC2012 I have the original download of JPEGImages and SegmentationClass, which provides the full color segmentation images. Any help would be much appreciated.

Here's a snippet of output that may or may not help, showing fcn_valid moving a lot. I'm not entirely sure what the output means, so any explanation on what it is could be useful.

2019-04-11 15:00:09,073 Host Epoch[78] Batch [66-67] Speed: 11.93 samples/sec fcn_valid=0.623302 2019-04-11 15:00:10,056 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:10,058 Host Labels: 0 0.6 -1.0 Waited for 2.59876251221e-05 seconds 2019-04-11 15:00:10,075 Host Epoch[78] Batch [67-68] Speed: 11.98 samples/sec fcn_valid=0.644102 2019-04-11 15:00:10,076 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:11,055 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:11,056 Host Labels: 0 0.6 -1.0 Waited for 3.50475311279e-05 seconds 2019-04-11 15:00:11,074 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:11,077 Host Epoch[78] Batch [68-69] Speed: 11.98 samples/sec fcn_valid=0.632405 2019-04-11 15:00:12,056 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:12,058 Host Labels: 0 0.6 -1.0 Waited for 2.50339508057e-05 seconds 2019-04-11 15:00:12,074 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:12,077 Host Epoch[78] Batch [69-70] Speed: 12.00 samples/sec fcn_valid=0.775874 2019-04-11 15:00:13,057 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:13,058 Host Labels: 0 0.6 -1.0 Waited for 2.59876251221e-05 seconds 2019-04-11 15:00:13,074 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:13,077 Host Epoch[78] Batch [70-71] Speed: 12.01 samples/sec fcn_valid=0.562744 2019-04-11 15:00:14,056 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:14,058 Host Labels: 0 0.6 -1.0 Waited for 0.000184059143066 seconds 2019-04-11 15:00:14,074 Host Labels: 0 0.6 -1.0 2019-04-11 15:00:14,075 Host Epoch[78] Batch [71-72] Speed: 12.03 samples/sec fcn_valid=0.552027

mcever avatar Apr 11 '19 21:04 mcever

After much debugging, I found that part of my issue was apparently that I was training without initializing the weights, so my predictions quickly converged to a bunch of NaN's. I decided to retrain, initializing with the ImageNet weights like so

Host start with arguments Namespace(backward_do_mirror=True, base_lr=0.0016, batch_images=12, cache_images=None, check_start=1, check_step=4, crop_size=500, data_root='data/VOCdevkit/', dataset=None, debug=False, from_epoch=0, gpus='1,2,3', kvstore='local', log_file='voc_rna-a1_cls21.log', lr_steps=None, lr_type='fixed', model='voc_rna-a1_cls21', origin_size=2048, output='train+_out/', phase='train', prefetch_threads=4, prefetcher='thread', save_predictions=False, save_results=True, scale_rate_range='0.7,1.3', split='train+', stop_epoch=None, test_flipping=False, test_scales=None, test_steps=1, to_epoch=500, weight_decay=0.0005, weights='models/ilsvrc-cls_rna-a_cls1000_ep-0001.params')

Meanwhile, I ran validation on the validation and train+ sets every 5 epochs to track training progress. Performance on the validation set began to stabilize around 250 epochs around 45 mIOU, so I began then reducing the learning rate like so

2019-04-19 16:52:48,408 Host start with arguments Namespace(backward_do_mirror=True, base_lr=0.0016, batch_images=12, cache_images=None, check_start=1, check_step=4, crop_size=500, data_root='data/VOCdevkit/', dataset=None, debug=False, from_epoch=240, gpus='1,2,3', kvstore='local', log_file='voc_rna-a1_cls21.log', lr_steps=None, lr_type='linear', model='voc_rna-a1_cls21', origin_size=2048, output='train+_outp2/', phase='train', prefetch_threads=4, prefetcher='thread', save_predictions=False, save_results=True, scale_rate_range='0.7,1.3', split='train+', stop_epoch=None, test_flipping=False, test_scales=None, test_steps=1, to_epoch=500, weight_decay=0.0005, weights='train+_out/voc_rna-a1_cls21_ep-0240.params')

Now, after a total of about 410 epochs (started reducing learning rate from 240), I am still only achieving a max of 54.77 mIOU on the validation set. This is very much lower than the results presented in the paper. Any advice on how to improve would be greatly appreciated.

mcever avatar Apr 23 '19 20:04 mcever

Hi, @mcever , I'm also trying to reproduce the results on VOC 2012 dataset. Have you reproduced the results as paper reported? If you have did it, can you share your training command?

rulixiang avatar Jul 19 '20 04:07 rulixiang