mcnSSD icon indicating copy to clipboard operation
mcnSSD copied to clipboard

A question about the SSD training

Open HengLan opened this issue 6 years ago • 4 comments

Hi, albanie,

Thanks a lot for your work.

Recently, I'd like to use SSD. To make sure the SSD work fine, I first tried to train the SSD on the provided VOC. I did not change anything. However, in training, the loss becomes weird as follows:

ssd_pascal_train Experiment name: ssd-pascal-0712-vt-32-300-flip-patch-distort

Training set: 0712 Testing set: 07

Prune checkpoints: 0 GPU: 1 Batch size: 32

Train + val: 1 Flip: 1 Patches: 1 Zoom: 0 Distort: 1

Learning Rate Schedule: 0.001 0.001 (warmup) 0.001 for 73 epochs 0.0001 for 35 epochs

Run experiment with these parameters? y or n y Warning: The model appears to be simplenn model. Using fromSimpleNN instead. In dagnn.DagNN.loadobj (line 19) In ssd_zoo (line 29) In ssd_init (line 28) In ssd_train (line 19) In ssd_pascal_train (line 213) Warning: The most recent version of vl_nnloss normalizes the loss by the batch size. The current version does not. A workaround is being used, but consider updating MatConvNet. In cnn_train_autonn (line 32) In ssd_train (line 20) In ssd_pascal_train (line 213) cnn_train_autonn: resetting GPU

ans =

CUDADevice with properties:

                  Name: 'GeForce GTX 1080'
                 Index: 1
     ComputeCapability: '6.1'
        SupportsDouble: 1
         DriverVersion: 9.1000
        ToolkitVersion: 8
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [2.1475e+09 65535 65535]
             SIMDWidth: 32
           TotalMemory: 8.5899e+09
       AvailableMemory: 7.0175e+09
   MultiprocessorCount: 20
          ClockRateKHz: 1809500
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1

train: epoch 01: 1/538: 2.6 (2.6) Hz conf_loss: 19.484 loc_loss: 2.778 mbox_loss: 22.263 train: epoch 01: 2/538: 2.9 (3.2) Hz conf_loss: 16.702 loc_loss: 2.812 mbox_loss: 19.514 train: epoch 01: 3/538: 3.0 (3.1) Hz conf_loss: 16.259 loc_loss: 2.886 mbox_loss: 19.145 train: epoch 01: 4/538: 3.0 (3.2) Hz conf_loss: 15.540 loc_loss: 2.796 mbox_loss: 18.336 train: epoch 01: 5/538: 3.2 (3.2) Hz conf_loss: 15.562 loc_loss: 2.812 mbox_loss: 18.374 train: epoch 01: 6/538: 3.2 (3.1) Hz conf_loss: 16.118 loc_loss: 2.863 mbox_loss: 18.981 train: epoch 01: 7/538: 3.2 (3.1) Hz conf_loss: 19.285 loc_loss: 3.871 mbox_loss: 23.156 train: epoch 01: 8/538: 3.2 (3.3) Hz conf_loss: 1031.250 loc_loss: 111.065 mbox_loss: 1142.315 train: epoch 01: 9/538: 3.2 (3.2) Hz conf_loss: 3470643036412389161829400576.000 loc_loss: 614700336259389324632522752.000 mbox_loss: 4085343446458754781300129792.000 train: epoch 01: 10/538: 3.2 (3.1) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 11/538: 3.2 (3.3) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 12/538: 3.2 (3.1) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 13/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 14/538: 3.2 (3.3) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 15/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 16/538: 3.2 (3.2) Hz conf_loss: NaN loc_loss: NaN mbox_loss: NaN train: epoch 01: 17/538:Operation terminated by user during Net/eval (line 136)

Could you give me some help to solve this problem? or how does it happen?

Thanks

HengLan avatar Jun 09 '18 14:06 HengLan

By the way, I trained it on Windows 10 using MatConvNet-25. I do not know if this will affect the training.

HengLan avatar Jun 09 '18 14:06 HengLan

This was already pointed out in previous issues. The default values for learning rate are too high, which leads to your problem, try lower values.

zacr0 avatar Jun 16 '18 00:06 zacr0

Hi, I have the same problem with training in which the loss values become NaN. I tried lower values for "steadyLR" and "gentleLR" parameters, but the problem still exists.

mnnejati avatar Jun 20 '18 17:06 mnnejati

Hi, @mnnejati ,

You may try to change steadyLR to 0.0001 and gentleLR to 0.00001 to train the SSD. I did so, and the training process seems normal so far.

Best,

HengLan avatar Jun 26 '18 00:06 HengLan