Image-Super-Resolution-via-Iterative-Refinement
Image-Super-Resolution-via-Iterative-Refinement copied to clipboard
Extremly Long Training Time
Hi,
I’m training the SR3 model on my dataset, which consists of 2000 training samples and 100 validation samples. I’m using 4 A100 GPUs, but it’s taking over 3 hours to complete just 100 iterations. Is this normal? Do you have any suggestions to improve the training speed?
Thanks
Here is my training config: phase: train gpu_ids: [0, 1, 2, 3] path:[ log: experiments/sr_ffhq_241113_184019/logs tb_logger: experiments/sr_ffhq_241113_184019/tb_logger results: experiments/sr_ffhq_241113_184019/results checkpoint: experiments/sr_ffhq_241113_184019/checkpoint resume_state: None experiments_root: experiments/sr_ffhq_241113_184019 ] datasets:[ train:[ name: FLAIR_SR_Train mode: LRHR dataroot: dataset/train_224_320 datatype: img l_resolution: 224 r_resolution: 320 batch_size: 8 num_workers: 8 use_shuffle: True data_len: -1 ] val:[ name: FLAIR_SR_Val mode: LRHR dataroot: /dataset/val_224_320 datatype: img l_resolution: 224 r_resolution: 320 data_len: 3 ] ] model:[ which_model_G: sr3 finetune_norm: False unet:[ in_channel: 6 out_channel: 3 inner_channel: 64 channel_multiplier: [1, 2, 4, 8, 8] attn_res: [] res_blocks: 1 dropout: 0.2 ] beta_schedule:[ train:[ schedule: linear n_timestep: 2000 linear_start: 1e-06 linear_end: 0.01 ] val:[ schedule: linear n_timestep: 2000 linear_start: 1e-06 linear_end: 0.01 ] ] diffusion:[ image_size: 320 channels: 3 conditional: True ] ] train:[ n_iter: 1000000 val_freq: 1000.0 save_checkpoint_freq: 1000.0 print_freq: 50 optimizer:[ type: adam lr: 3e-06 ] ema_scheduler:[ step_start_ema: 5000 update_ema_every: 1 ema_decay: 0.9999 ] ] wandb:[ project: super_resolution_flair ] distributed: True log_wandb_ckpt: False log_eval: False enable_wandb: False