SRGAN-tensorflow icon indicating copy to clipboard operation
SRGAN-tensorflow copied to clipboard

Resume training not working?

Open nickhdfan opened this issue 7 years ago • 7 comments

So I set up an instance with 16 cores, and 32GB RAM with Tesla P100. Anyways to increase training speed by editing the shell script for SRResNet and SRGAN? Right now the performance is the same and unbearable as my GTX 1060. Right now I've tried increasing batch_size to 48, both queue capacity to 16384. Am I doing it wrong? Only 3GB RAM are used. Update: I've reduced batch_size to 16 and increase queue capacity to 32768 and the speed increase 1.5x but still not what I expect from Tesla P100.

nickhdfan avatar Feb 01 '18 12:02 nickhdfan

Hi, @nickhdfan

If you use batch_size 48 and result in same batch / sec rate as batch_size 32, you are actually 1.5X faster because your network can process (48 images / batch) * (x batches / sec) => 48x images / sec, which is faster than (32 images / batch) * (x batches / sec) => 32x images /sec. (Am I misunderstanding something ?)

Actually, RAM is just for the pre-processing. The Detailed process I expected is as below:

  1. CPU read images from the dataset in the disk and perform the preprocessing (random flip and crop, etc.)
  2. The pre-processed images are batched into training batches and cached into RAM by the CPU.
  3. The GPU read the training batches and perform the forward and backward process.

In the simple three steps, there are actually many factors that affects your training process. For example, even if you have a strong GPU and CPU, the process may be stuck in slow hard disks (CPU can only read certain amount of images per time step). If your hard disk, CPU and GPU are fast, the speed of the PCIe port of the motherboard connecting the CPU and GPU also affect.

Here is my recommendation: Monitor the following things

  1. CPU usage (e.g. Using htop)
  2. GPU usage (e.g. Using nvidia-smi)
  3. check the status of the queue (is the queue full all the time?)

If all the analysis above can not find out the bottleneck, then the problem may lies in the tensorflow implementation, too.

Thanks!!

brade31919 avatar Feb 03 '18 04:02 brade31919

GPU usage is sometimes 90%, sometimes 40%, most of the time 0%, and GPU VRAM usage is 8653MB. The CPU usage is 40% with 10350MB RAM used. I don't know how to check queue status but these results made it looked like the GPU and CPU weren't bottlenecked. I'm using SSD BTW and my configuration are : batch_size=96, name_queue_capacity=65536, image_queue_capacity=65536.

Update : It seems that the queue capacity option is doing something undesirable while increasing the batch_size to 160 does increase the VRAM usage and also GPU usage which is what I wanted. GPU usage spike to 97% half of the time while VRAM usage is 9GB/16GB, CPU usage is 95% all the time. What bothers me is that sometimes GPU is at 0% usage? Does it means that it ran out of images to process or the CPU has become the bottleneck?

Update 2 : increasing the batch_size makes it 10x slower rather than increasing image_queue_capacity. I don't understand.

BTW how long does training usually takes? I've tried using the checkpoint method to resume training but it simply restart the training...

IDK why but checkpoint thing does not work and my model start training from 0, I've wasted a lot of time already.

nickhdfan avatar Feb 03 '18 07:02 nickhdfan

Here's my configuration :

#!/usr/bin/env bash CUDA_VISIBLE_DEVICES=0 python main.py
--output_dir ./experiment_SRResnet/
--summary_dir ./experiment_SRResnet/log/
--mode train
--is_training True
--task SRResnet
--batch_size 48
--flip True
--random_crop True
--crop_size 24
--input_dir_LR ./data/test_LR/
--input_dir_HR ./data/test_HR/
--num_resblock 16
--name_queue_capacity 16384
--image_queue_capacity 16384
--perceptual_mode MSE
--queue_thread 32
--ratio 0.001
--learning_rate 0.0001
--decay_step 500000
--decay_rate 0.1
--stair False
--beta 0.9
--max_iter 1000000
--save_freq 20000
--pre_trained_model False
--checkpoint ./experiment_SRResnet/model-620000

Why does it not resume?

nickhdfan avatar Feb 03 '18 10:02 nickhdfan

You need to make sure the path is correct and you do have the corresponding checkpoints in the corresponding path. If it still happens, you can pdb at the restoring process and check. (line 312~319 in main.py)

brade31919 avatar Feb 04 '18 15:02 brade31919

What is pdb? I've triple checked that the path is correct and it correctly identifies the checkpoint. but what it basically does is deleting the model that I referred as the checkpoint and all model before it. e.g. If I pick model-360000 as my checkpoint, it will delete model-360000 and all the model before it but it still trains from 0 to model-360000 first before deleting. Basically, a huge waste of time, and BTW do I need those model 20000-360000 that was deleted by the program?

nickhdfan avatar Feb 08 '18 00:02 nickhdfan

There is a difference between resuming training and loading model weights.

Resume Training:

The correct syntax for your checkpoint file is not

--checkpoint ./experiment_SRResnet/model-620000

it is

--checkpoint ./experiment_SRResnet/

Also, make sure the following parameter is set accordingly.

--pre_trained_model False

This will find the checkpoint file in your folder and resume training.

Begin Training using existing weights

When you want to load a fully trained model, like when you begin sgan training.

--pre_trained_model False --checkpoint ./experiment_SRResnet/model-620000

ryancom16 avatar Apr 19 '18 02:04 ryancom16

What does --pre_trained_model False mean? When I want to resume training with --checkpoint ./experiment_SRResnet/ and --pre_trained_model False I get the error:

ValueError: The passed save_path is not a valid checkpoint: ./experiment_SRResnet/

When I remove --pre_trained_model and just do --checkpoint ./experiment_SRResnet/ the resume works.

Ianmcmill avatar Oct 30 '18 18:10 Ianmcmill