Train loss doesn't decrease
Thank you for publishing source code. With reference to "Readme", I first tried out "train_svhn.py", but the train loss does not go down well. Could you check whether something is wrong?
- download "svhn_datasets_and_models.zip"
- edit generated/easy/train.csv insert "4 4" to line1
- prepare the curriculum specification
[
{
"train": "<any directory>/svhn_dataset_and_models/generated/easy/train.csv",
"validation": "<any directory>/svhn_dataset_and_models/generated/easy/valid.csv"
}
]
- execute train script
$ python train_svhn.py specification.json <log_dir> --blank-label 10 --char-map ../datasets/svhn/svhn_char_map.json -b 64 -g 0
<dir>/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:131: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
format(optimizer.eps))
epoch iteration main/loss main/accuracy lr fast_validation/main/loss fast_validation/main/accuracy validation/main/loss validation/main/accuracy
0 100 7.50537 0.00480469 0.00308566
0 200 6.50787 0.0170703 0.00425853
0 300 6.56413 0.0196875 0.00509208
0 400 7.11518 0.0114844 0.00574294
0 500 7.72572 0 0.00627392
0 600 7.69402 0 0.00671828
0 700 7.39172 0.00714844 0.0070964
0 800 6.21048 0.0192187 0.00742193
0 900 6.74564 0.0199219 0.00770463
0 1000 6.19807 0.0205859 0.00795176 6.12725 0.0204327
0 1100 6.18472 0.0188672 0.00816892
0 1200 6.18693 0.0186328 0.00836054
0 1300 6.69872 0.0203516 0.00853021
0 1400 6.78628 0.0203125 0.00868087
0 1500 6.78714 0.0190234 0.00881497
1 1600 6.71787 0.02 0.00893457
1 1700 6.40516 0.019375 0.00904141
1 1800 6.9848 0.0145703 0.00913701
1 1900 6.7551 0.0185156 0.00922265
1 2000 6.21905 0.0180078 0.00929946 6.17801 0.0204327
1 2100 6.07871 0.0210938 0.00936842
1 2200 6.2583 0.0188281 0.00943037
1 2300 6.10771 0.0191406 0.00948608
1 2400 6.42393 0.0198438 0.0095362
1 2500 6.79451 0.0196484 0.00958132
1 2600 6.73464 0.0185547 0.00962197
1 2700 6.38567 0.0184375 0.0096586
1 2800 6.06156 0.0200391 0.00969162
1 2900 6.13261 0.0175781 0.0097214
1 3000 6.06467 0.0183203 0.00974827 6.04983 0.0204327
1 3100 6.09032 0.0203906 0.00977252
2 3200 6.0301 0.0217969 0.0097944
2 3300 6.14339 0.0198828 0.00981416
2 3400 6.28762 0.0178125 0.00983201
2 3500 6.20747 0.0170313 0.00984812
2 3600 6.68129 0.0208984 0.00986268
2 3700 6.14636 0.0198828 0.00987584
2 3800 6.11925 0.0185156 0.00988773
2 3900 6.01957 0.0199609 0.00989847
2 4000 6.00062 0.0197656 0.00990818 6.0088 0.0204327
2 4100 6.01043 0.0193359 0.00991696
2 4200 6.02558 0.0192187 0.0099249
2 4300 6.60501 0.0197266 0.00993207
2 4400 6.14031 0.0174609 0.00993856
2 4500 6.09017 0.0189844 0.00994443
2 4600 6.07965 0.0204688 0.00994973
3 4700 7.72351 0.00117187 0.00995453
3 4800 7.69427 0 0.00995887
3 4900 7.6744 0 0.00996279
3 5000 7.648 0.00160156 0.00996634 10.9449 0
3 5100 7.67579 0 0.00996955
3 5200 7.33484 0.00734375 0.00997245
3 5300 7.32619 0.00722656 0.00997508
3 5400 7.13675 0.0139844 0.00997745
3 5500 7.6732 0 0.0099796
3 5600 7.67175 0 0.00998155
3 5700 7.68901 3.90625e-05 0.0099833
3 5800 7.66603 0 0.00998489
3 5900 7.52425 0.00371094 0.00998633
3 6000 7.47132 0.00558594 0.00998764 10.7991 0
3 6100 7.66778 0 0.00998881
3 6200 7.6659 7.8125e-05 0.00998988
4 6300 7.65129 0 0.00999084
4 6400 7.67035 0 0.00999172
4 6500 7.67038 0 0.0099925
4 6600 7.65043 0 0.00999322
4 6700 7.65205 0 0.00999386
4 6800 7.66811 3.90625e-05 0.00999445
4 6900 7.6578 0 0.00999498
4 7000 7.65613 0.000117187 0.00999546 7.66984 0
4 7100 7.56788 0.00222656 0.00999589
4 7200 7.66979 0 0.00999628
4 7300 7.5385 0.00460937 0.00999663
4 7400 7.65839 0.000664063 0.00999695
4 7500 7.66035 0 0.00999724
4 7600 7.67189 0 0.00999751
4 7700 7.66347 0.00015625 0.00999774
4 7800 7.64676 0 0.00999796
5 7900 7.66213 0.000234375 0.00999815
5 8000 7.64856 0.00015625 0.00999833 7.66626 0
5 8100 7.66833 0 0.00999849
5 8200 7.65137 0 0.00999863
5 8300 7.41944 0.0117578 0.00999876
5 8400 7.25518 0.0126172 0.00999888
5 8500 7.39664 0.015625 0.00999899
5 8600 7.61295 0.00492188 0.00999908
5 8700 7.87834 0.00515625 0.00999917
5 8800 7.74946 0.0028125 0.00999925
5 8900 7.71717 0 0.00999932
5 9000 7.67788 0 0.00999939 9.00867 0
5 9100 7.68219 7.8125e-05 0.00999944
5 9200 7.67184 0 0.0099995
5 9300 7.66276 0 0.00999954
6 9400 7.70615 0.0021875 0.00999959
6 9500 7.65457 0.00164062 0.00999963
6 9600 7.68174 0 0.00999966
6 9700 7.6786 0 0.0099997
6 9800 7.66229 0 0.00999972
6 9900 7.69749 0.000195313 0.00999975
6 10000 7.67097 0 0.00999977 7.68619 0
6 10100 7.69214 0 0.0099998
6 10200 7.63691 0.00105469 0.00999982
6 10300 7.72377 3.90625e-05 0.00999983
6 10400 7.64548 0 0.00999985
6 10500 7.68391 0 0.00999986
6 10600 7.66396 0 0.00999988
6 10700 7.64902 0 0.00999989
6 10800 7.67067 0 0.0099999
6 10900 7.64861 0 0.00999991
7 11000 7.66595 0 0.00999992 7.67054 0
7 11100 7.66005 0 0.00999992
7 11200 7.6663 0 0.00999993
7 11300 7.67519 0 0.00999994
7 11400 7.66069 0 0.00999994
7 11500 7.67084 0 0.00999995
7 11600 7.67509 0 0.00999995
7 11700 7.66532 0 0.00999996
7 11800 7.48501 0.00386719 0.00999996
7 11900 7.65868 0 0.00999997
7 12000 7.66759 0 0.00999997 7.66575 0
7 12100 7.64796 0 0.00999997
7 12200 7.642 0 0.00999998
7 12300 7.64562 0 0.00999998
7 12400 7.65264 0 0.00999998
8 12500 7.67088 0 0.00999998
8 12600 7.65142 0 0.00999998
8 12700 7.65538 0 0.00999998
8 12800 7.65558 0 0.00999999
8 12900 7.66965 0 0.00999999
8 13000 7.65551 0 0.00999999 7.77481 0
8 13100 7.57688 0.000195313 0.00999999
8 13200 7.6336 0 0.00999999
8 13300 6.82549 0.0161719 0.00999999
8 13400 6.67704 0.0145703 0.00999999
8 13500 7.42664 0.0046875 0.00999999
8 13600 7.64274 0 0.00999999
8 13700 7.43383 0.00761719 0.00999999
8 13800 7.60467 0.000390625 0.00999999
8 13900 6.867 0.0198438 0.01
8 14000 7.47555 0.00910156 0.01 7.68502 0
9 14100 7.65677 0 0.01
9 14200 7.5657 0.000273438 0.01
9 14300 7.61998 0 0.01
9 14400 7.61674 0.00382812 0.01
9 14500 7.68163 0 0.01
9 14600 7.65586 0 0.01
9 14700 7.41383 0.00523438 0.01
9 14800 7.31547 0.0123438 0.01
9 14900 7.66587 0 0.01
9 15000 7.65826 0.000625 0.01 8.35651 0
9 15100 7.65988 0 0.01
9 15200 7.65109 0 0.01
9 15300 7.63758 0 0.01
9 15400 7.64964 0 0.01
9 15500 7.63695 0 0.01
9 15600 7.64448 0 0.01
Even if I use images generated by ”create_svhn_dataset_4_images.py”, the same result will be obtained.
The learning rate is too high!
You can check this by having a look at all images in the folder <log_dir>/bboxes. You should see that the predicted bboxes jump around a lot!
I suggest setting the learning rate to 1e-4 or 1e-5 with the command line switch -lr or --learning-rate
Thank you for your reply! As you suggested, the accuracy has improved, but it is not enough.
python train_svhn.py <dir>/svhn_dataset_and_models/specification.json ./work --gpus 3 --blank-label 10 --char-map ../datasets/svhn/svhn_char_map.json -b 64 --learning-rate 0.0001 --epoch 200
<dir>/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:131: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size.
format(optimizer.eps))
epoch iteration main/loss main/accuracy lr fast_validation/main/loss fast_validation/main/accuracy validation/main/loss validation/main/accuracy
0 100 7.30914 0.00507813 3.08566e-05
0 200 6.04959 0.0198828 4.25853e-05
0 300 5.97783 0.0191797 5.09208e-05
0 400 5.97065 0.0211719 5.74294e-05
0 500 5.94329 0.0212109 6.27392e-05
0 600 5.94245 0.0192969 6.71828e-05
0 700 5.95647 0.020625 7.0964e-05
0 800 5.89843 0.0215625 7.42193e-05
0 900 5.91704 0.0216406 7.70463e-05
0 1000 5.92901 0.0198828 7.95176e-05 6.10376 0.0174028
0 1100 5.92836 0.0207422 8.16892e-05
0 1200 5.92269 0.0225781 8.36054e-05
0 1300 5.89819 0.0221875 8.53021e-05
0 1400 5.89788 0.0219922 8.68087e-05
0 1500 5.88448 0.0226562 8.81497e-05
1 1600 5.89985 0.0208203 8.93457e-05
1 1700 5.87362 0.0215625 9.04141e-05
1 1800 5.87873 0.0211719 9.13701e-05
1 1900 5.88609 0.0219922 9.22265e-05
1 2000 5.86692 0.0218359 9.29946e-05 5.91479 0.021234
1 2100 5.87499 0.0230469 9.36842e-05
1 2200 5.90617 0.0214844 9.43037e-05
1 2300 5.88547 0.0224219 9.48608e-05
1 2400 5.88797 0.02125 9.5362e-05
1 2500 5.87188 0.0216797 9.58132e-05
1 2600 5.86528 0.0226953 9.62197e-05
1 2700 5.85535 0.0210156 9.6586e-05
1 2800 5.85524 0.0235156 9.69162e-05
1 2900 5.84669 0.0228906 9.7214e-05
1 3000 5.87788 0.0221484 9.74827e-05 5.86614 0.0217849
1 3100 5.86994 0.0215625 9.77252e-05
2 3200 5.84037 0.0232422 9.7944e-05
2 3300 5.83626 0.0228516 9.81416e-05
2 3400 5.79569 0.0234766 9.83201e-05
2 3500 5.78465 0.0230859 9.84812e-05
2 3600 5.74576 0.0256641 9.86268e-05
2 3700 5.76113 0.0221094 9.87584e-05
2 3800 5.73173 0.0233203 9.88773e-05
2 3900 5.731 0.02375 9.89847e-05
2 4000 5.71808 0.0241406 9.90818e-05 5.77324 0.0233624
2 4100 5.72374 0.0244141 9.91696e-05
2 4200 5.7117 0.0257031 9.9249e-05
2 4300 5.71056 0.0239453 9.93207e-05
2 4400 5.71849 0.0242188 9.93856e-05
2 4500 5.70581 0.02625 9.94443e-05
2 4600 5.68513 0.0243359 9.94973e-05
3 4700 5.70342 0.0253125 9.95453e-05
3 4800 5.67535 0.0263672 9.95887e-05
3 4900 5.67486 0.0269922 9.96279e-05
3 5000 5.64855 0.0248828 9.96634e-05 5.97818 0.0250651
3 5100 5.67777 0.0243359 9.96955e-05
3 5200 5.66923 0.0266016 9.97245e-05
3 5300 5.69099 0.0239063 9.97508e-05
~~~~~~~~~~~~~~~~~~~~~
198 310400 3.60274 0.178672 0.0001
198 310500 3.57621 0.180156 0.0001
198 310600 3.53797 0.181992 0.0001
198 310700 3.63558 0.180273 0.0001
198 310800 3.59109 0.179375 0.0001
198 310900 3.61663 0.176719 0.0001
199 311000 3.5567 0.180586 0.0001 6.24907 0.0733674
199 311100 3.60563 0.178945 0.0001
199 311200 3.66259 0.179727 0.0001
199 311300 3.59662 0.185078 0.0001
199 311400 3.53541 0.189766 0.0001
199 311500 3.58667 0.182734 0.0001
199 311600 3.52114 0.195586 0.0001
199 311700 3.58835 0.180156 0.0001
199 311800 3.5372 0.185391 0.0001
199 311900 3.57168 0.183438 0.0001
199 312000 3.47688 0.195195 0.0001 5.776 0.0995843
199 312100 3.48593 0.189102 0.0001
199 312200 3.55354 0.18082 0.0001
199 312300 3.48362 0.189219 0.0001
199 312400 3.52996 0.185586 0.0001
200 312500 3.53738 0.182422 0.0001
So, I have two questions.
-
what accuracy is obtained by this learning?
-
how many epochs does this network need to learn?
Do you mean what kind of accuracy is obtained? If so it is word level accuracy.
The network does not need many epochs to learn normally. 10 should be enough to get a first decent version, but then you should restart the training with randomly initialized recognition net, but using the trained weights for the localziation net.
Did you have a look at the pictures in <log_dir>/bboxes? How do the predicted bboxes look like? These images are there to give you a feeling of the performance of the network and if you set all hyperparameters correctly.
Sorry, what I wanted to know is how much "main / accuracy" is improved when this script is executed. And I tried training with another parameter again, the accuracy has improved.
I think that I could not set an appropriate learning rate for batch size.
$ python train_svhn.py <dir>/svhn_dataset_and_models/specification.json ./work --gpus 0 1 2 3 --blank-label 10 --char-map ../datasets/svhn/svhn_char_map.json -b 16 --learning-rate 0.0008 --epoch 100
<dir>/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:131: UserWarning: optimizer.eps is changed to 4e-08 by MultiprocessParallelUpdater for new batch size.
format(optimizer.eps))
epoch iteration main/loss main/accuracy lr fast_validation/main/loss fast_validation/main/accuracy validation/main/loss validation/main/accuracy
0 100 7.08026 0.00921875 0.000246853
0 200 6.48398 0.0171875 0.000340683
0 300 6.3467 0.0167188 0.000407367
0 400 6.07347 0.0220312 0.000459436
0 500 6.14111 0.0207813 0.000501914
0 600 6.07865 0.019375 0.000537463
0 700 6.07493 0.0190625 0.000567712
0 800 6.04775 0.0148437 0.000593755
0 900 5.98705 0.0223437 0.00061637
0 1000 5.96897 0.0203125 0.000636141 5.9492 0.0199219
0 1100 6.01165 0.0189062 0.000653513
0 1200 5.96329 0.02375 0.000668843
98 153400 0.903547 0.760156 0.0008
98 153500 0.897576 0.7625 0.0008
98 153600 0.916847 0.756719 0.0008
98 153700 0.984916 0.741094 0.0008
98 153800 0.970778 0.742188 0.0008
98 153900 0.985056 0.745938 0.0008
98 154000 0.990586 0.739531 0.0008 1.20713 0.706172
98 154100 0.939159 0.7525 0.0008
98 154200 0.932501 0.754062 0.0008
98 154300 1.02911 0.727812 0.0008
98 154400 0.991427 0.747344 0.0008
98 154500 1.03597 0.730156 0.0008
98 154600 0.975343 0.747031 0.0008
99 154700 0.984289 0.744844 0.0008
99 154800 0.911613 0.760312 0.0008
99 154900 0.865186 0.767969 0.0008
99 155000 0.953758 0.743125 0.0008 1.42776 0.676875
99 155100 0.94563 0.751875 0.0008
99 155200 0.951638 0.746406 0.0008
99 155300 0.908591 0.756875 0.0008
99 155400 0.926625 0.750469 0.0008
99 155500 0.969775 0.748125 0.0008
99 155600 0.981316 0.75 0.0008
99 155700 0.93322 0.755625 0.0008
99 155800 0.999408 0.736719 0.0008
99 155900 1.00439 0.739844 0.0008
99 156000 0.944262 0.749375 0.0008 1.29065 0.695078
99 156100 0.968931 0.749687 0.0008
99 156200 0.955661 0.747344 0.0008
However, it seems that localization is not working well.
If so, could you tell me a good way to improve?

Hmm, two things:
- it is not necessarily good the set the learning rate to a higher value than
1e-4, you are using8e-4right now. - In order to increase localization performance, you can try the following: Locate the trained model (
.npzfile) and restart the training, using this model (-ror--resume+ path to model). You should also make the network load only the trained weights of the localization network and not the weights of the recognition network (use--load-localizationfor that). This should help.