KAIR icon indicating copy to clipboard operation
KAIR copied to clipboard

Slow convergence

Open jakc4103 opened this issue 3 years ago • 2 comments

Hi

Thanks for the great work again.

I am wondering whether there might be something wrong with the learning rate scheduler. I tested "main_train_rrdb_psnr.py", found that the learning rate quickly been scheduled to "1.563e-06" from early stage of training. Thought I checked update_learning_rate and option, it seems ok implementation-wise.

But I observed much slower convergence comparing to implementation from xinntao, Did you observe same phenomenon?

jakc4103 avatar Sep 16 '20 12:09 jakc4103

Change https://github.com/cszn/KAIR/blob/3eb3cc7776fa8c57e8ed7c71bfa8039beb4c6677/options/train_msrresnet_psnr.json#L65

The training speed of KAIR is slower than BasicSR by xintao, three possible reasons at least: 1, https://github.com/xinntao/BasicSR/blob/master/docs/DatasetPreparation.md 2, https://github.com/xinntao/BasicSR/blob/14bafa5e03468775544f8711d7da7a61dbb3d664/basicsr/train.py#L13 3, https://github.com/xinntao/BasicSR/blob/14bafa5e03468775544f8711d7da7a61dbb3d664/basicsr/train.py#L35

cszn avatar Sep 16 '20 12:09 cszn

I checked the G_schedular_milestone, the setting is fine. The data indeed is not aligned for the training, thought I think it's not the major problem here. And I did not use distributed training.

Here is the log I expected learning rate to be near 1e-4, but it's 1.5e-6 through out training phase.

20-09-14 18:25:35.666 :   task: rrdb
  model: plain
  gpu_ids: [0]
  scale: 4
  n_channels: 3
  sigma: 0
  sigma_test: 0
  merge_bn: False
  merge_bn_startpoint: 400000
  path:[
    root: superresolution
    pretrained_netG: None
    task: superresolution/rrdb
    log: superresolution/rrdb
    options: superresolution/rrdb/options
    models: superresolution/rrdb/models
    images: superresolution/rrdb/images
  ]
  datasets:[
    train:[
      name: train_dataset
      dataset_type: sr
      dataroot_H: ./opensource_code/KAIR/trainsets/trainH
      dataroot_L: None
      H_size: 96
      dataloader_shuffle: True
      dataloader_num_workers: 8
      dataloader_batch_size: 16
      phase: train
      scale: 4
      n_channels: 3
    ]
    test:[
      name: test_dataset
      dataset_type: sr
      dataroot_H: ./opensource_code/KAIR/testsets/set5
      dataroot_L: None
      phase: test
      scale: 4
      n_channels: 3
    ]
  ]
  netG:[
    net_type: rrdb
    in_nc: 3
    out_nc: 3
    nc: 64
    nb: 23
    gc: 32
    ng: 2
    reduction: 16
    act_mode: R
    upsample_mode: upconv
    downsample_mode: strideconv
    init_type: orthogonal
    init_bn_type: uniform
    init_gain: 0.2
    scale: 4
  ]
  train:[
    G_lossfn_type: l1
    G_lossfn_weight: 1.0
    G_optimizer_type: adam
    G_optimizer_lr: 0.0001
    G_optimizer_clipgrad: None
    G_scheduler_type: MultiStepLR
    G_scheduler_milestones: [200000, 400000, 600000, 800000, 1000000, 2000000]
    G_scheduler_gamma: 0.5
    G_regularizer_orthstep: None
    G_regularizer_clipstep: None
    checkpoint_test: 500
    checkpoint_save: 1000
    checkpoint_print: 100
  ]
  opt_path: ./opensource_code/KAIR/options/train_rrdb_psnr.json
  is_train: True

20-09-14 18:25:35.666 : Random seed: 7237
20-09-14 18:25:35.731 : Number of train images: 800, iters: 50
20-09-14 18:25:40.496 : 
Networks name: RRDB
Params number: 16697987
Net structure:
RRDB(
  (model): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
----------
skip model and weight info here
----------

20-09-14 18:26:55.455 : <epoch:  1, iter:     100, lr:1.563e-06> G_loss: 4.142e-01 
20-09-14 18:28:11.919 : <epoch:  3, iter:     200, lr:1.563e-06> G_loss: 3.753e-01 
20-09-14 18:29:25.745 : <epoch:  5, iter:     300, lr:1.563e-06> G_loss: 1.823e-01 
20-09-14 18:30:39.381 : <epoch:  7, iter:     400, lr:1.563e-06> G_loss: 1.380e-01 
20-09-14 18:31:51.148 : <epoch:  9, iter:     500, lr:1.563e-06> G_loss: 1.618e-01 
20-09-14 18:31:51.498 : ---1-->   baby.bmp | 17.50dB
20-09-14 18:31:51.575 : ---2-->   bird.bmp | 15.07dB
20-09-14 18:31:51.772 : ---3--> butterfly.bmp | 12.41dB
20-09-14 18:31:51.880 : ---4-->   head.bmp | 20.18dB
20-09-14 18:31:51.991 : ---5-->  woman.bmp | 16.58dB
20-09-14 18:31:52.021 : <epoch:  9, iter:     500, Average PSNR : 16.35dB

20-09-14 18:33:03.716 : <epoch: 11, iter:     600, lr:1.563e-06> G_loss: 1.111e-01 
20-09-14 18:34:16.212 : <epoch: 13, iter:     700, lr:1.563e-06> G_loss: 1.332e-01 
20-09-14 18:35:28.055 : <epoch: 15, iter:     800, lr:1.563e-06> G_loss: 1.334e-01 
20-09-14 18:36:40.067 : <epoch: 17, iter:     900, lr:1.563e-06> G_loss: 1.078e-01 
20-09-14 18:37:51.909 : <epoch: 19, iter:   1,000, lr:1.563e-06> G_loss: 1.308e-01 
20-09-14 18:37:51.911 : Saving the model.
20-09-14 18:37:52.582 : ---1-->   baby.bmp | 17.79dB
20-09-14 18:37:52.670 : ---2-->   bird.bmp | 15.16dB
20-09-14 18:37:52.738 : ---3--> butterfly.bmp | 12.61dB
20-09-14 18:37:52.870 : ---4-->   head.bmp | 20.54dB
20-09-14 18:37:52.938 : ---5-->  woman.bmp | 16.97dB
20-09-14 18:37:52.969 : <epoch: 19, iter:   1,000, Average PSNR : 16.62dB

20-09-14 18:39:04.704 : <epoch: 21, iter:   1,100, lr:1.563e-06> G_loss: 1.053e-01 
20-09-14 18:40:18.259 : <epoch: 23, iter:   1,200, lr:1.563e-06> G_loss: 1.111e-01 
20-09-14 18:41:30.275 : <epoch: 25, iter:   1,300, lr:1.563e-06> G_loss: 1.112e-01 
20-09-14 18:42:43.876 : <epoch: 27, iter:   1,400, lr:1.563e-06> G_loss: 1.048e-01 
20-09-14 18:43:56.993 : <epoch: 29, iter:   1,500, lr:1.563e-06> G_loss: 1.095e-01 
20-09-14 18:43:57.203 : ---1-->   baby.bmp | 18.24dB
20-09-14 18:43:57.280 : ---2-->   bird.bmp | 15.36dB
20-09-14 18:43:57.349 : ---3--> butterfly.bmp | 13.02dB
20-09-14 18:43:57.416 : ---4-->   head.bmp | 20.86dB
20-09-14 18:43:57.493 : ---5-->  woman.bmp | 17.45dB
20-09-14 18:43:57.518 : <epoch: 29, iter:   1,500, Average PSNR : 16.99dB

jakc4103 avatar Sep 16 '20 13:09 jakc4103