SRGAN icon indicating copy to clipboard operation
SRGAN copied to clipboard

Your hardware + GPU setup for training and average time to complete training

Open subzerofun opened this issue 7 years ago • 46 comments

First off, thanks for publishing the code! I've looked at a lot of SRGAN implementations on github but yours is the first with an understandable code structure for Tensorflow beginners.

I just wondered how your hardware setup looked like: CPU model, which GPU – and how many?

I'm using macOS 10.12.5 with an i7 6700k and two old crappy GPUs (for now, my GTX 1080 Ti arrives on Thursday! 🎉). Images get loaded from SSD (Read: ~600 MB/s).

Model GPU Memory CUDA Cores CUDA V. GPU Max Clock rate Mem Clock rate Mem Bus Width:
GTX 780 6144 MB 2304 3.5 902 MHz 3004 Mhz 384-bit
GTX 770 2048 MB 1536 3.0 1110 MHz 3505 Mhz 256-bit
SOON:
GTX 1080 Ti 11.264 MB 3584 6.1 ~1700 MHz 11010 Mhz 352 bit

So for now i just have the two 700 series cards available. Since Tensorflow 1.2 doesn't offer GPU support for macOS i had to revert to 1.1.

Is it normal that i have to wait around 10min until the training actually starts? It takes 10min from entering the cmd until i see the Epochs counting down: Epoch [60/2000] 199 time: 2.3310s, d_loss: 0.13401291 g_loss: 0.03180185 (mse: 0.015155 vgg: 0.014024 adv: 0.002623)

I have completed the first 100 Epochs for SRGAN init, and am now @ Epoch Nr. 60/2000 of the actual training. But it would take me ~11 days to get to step 2000!

Do you think the 1080 Ti could cut the training time down to a bearable amount?

How long did it take for you to complete the 2000 Epochs?

And one more (so i don't have to open a separate issue): When i'm done with training can i input an image that's not from the training set to upscale it? Could you please explain what i would need to change/add to the code?

subzerofun avatar Jul 04 '17 19:07 subzerofun

Hi, I am using Titan X Pascal, 32GB RAM. I think 1080 Ti should be ok ~~

It is unnormal to wait 10 mins to let it start to train. May be the compile time of SubpixelConv is very long in your machine? It may cause by CPU.

It takes me 2 days to complete 2000 epochs when I use DIV2K dataset.

After you train the model, you can input any image to upscale it, check the def evalution() in main.py

zsdonghao avatar Jul 04 '17 19:07 zsdonghao

Thanks for the quick answer! Can't figure out the reason for the long setup time...

The first steps go by pretty fast:

  [TL] FlattenLayer VGG19/flatten: 25088
  [TL] DenseLayer  VGG19/fc6: 4096 relu
  [TL] DenseLayer  VGG19/fc7: 4096 relu
  [TL] DenseLayer  VGG19/fc8: 1000 identity
build model finished: **0.553778s**
build model started
  [TL] FlattenLayer VGG19/flatten: 25088
  [TL] DenseLayer  VGG19/fc6: 4096 relu
  [TL] DenseLayer  VGG19/fc7: 4096 relu
  [TL] DenseLayer  VGG19/fc8: 1000 identity
build model finished: **0.136154s**

But then it freezes for a few minutes at

  [TL] SubpixelConv2d  SRGAN_g/pixelshufflerx2/2: scale: 2 n_out_channel: 64 act: relu
  [TL] Conv2dLayer SRGAN_g/out: shape:[1, 1, 64, 3] strides:[1, 1, 1, 1] pad:SAME act:tanh
  [*] geting variables with SRGAN_g

and then again for a few minutes at

got  31: SRGAN_d/res/bn3/gamma:0   (512,)
got  32: SRGAN_d/ho/dense/W:0   (18432, 1)
got  33: SRGAN_d/ho/dense/b:0   (1,)

Nevertheless, when it starts the speed is OK (as good as it can be with my 700-series cards). I'm really excited to swap out the 770 on Thursday and see how well the 1080 Ti will do!

subzerofun avatar Jul 04 '17 22:07 subzerofun

Can you download the code?

Andreababy avatar Jul 08 '17 01:07 Andreababy

i cannot download it sucessful. Do you know the reason? Thank you.

Andreababy avatar Jul 08 '17 01:07 Andreababy

What code do you mean specifically? The whole repository? Just enter terminal, cd into a folder where you store your projects and then execute: git clone https://github.com/zsdonghao/SRGAN.git

The output should be:

Cloning into 'SRGAN'...
remote: Counting objects: 228, done.
remote: Compressing objects: 100% (13/13), done.
Receiving objects: 100% (228/228), 117.82 MiB | 2.69 MiB/s, done.
Resolving deltas: 100% (111/111), done.

subzerofun avatar Jul 08 '17 14:07 subzerofun

@subzerofun I have the same concern as you. The model takes quite some time to actually get down to training. I suspect it is a CPU thing, since I'm running this on my home PC which packs significantly less CPU cores than a server. Did you find a way to speed things up?

rlayne avatar Jul 08 '17 17:07 rlayne

No, unfortunately not... The first time i tried my CPU speed was reaching 3,8 Ghz with Turbo Boost. I forgot to disable Speed Stepping in the BIOS and went back, changed the CPU speed to be at 4.2 Ghz at all times, turned off Speed Stepping and now the training setup takes 2-3 minutes less.

I was already at epoch Nr. 570 of 2000 and then my SSDs disk space was full... Argh. I lost all my progress because the .npz file with the weights is now corrupt...

So i have to start again from epoch 60... two days lost for nothing.

So make sure you backup your .npz files every now and then and note the epoch you were at if you pause training so that you don't lose your progress like i did!

I will try to compile Tensorflow from source since i always get messages tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

I don't know if that will speed up things, but i will report if it does.

Another thing which may help is creating a RAM disk, copying all the repo files there and try again.

In the CUDA documentation there is an option to set your cache folder, which you could also point to the RAM disk.

But since it's probably a CPU related issue i don't know how much this is gonna affect things.

subzerofun avatar Jul 08 '17 19:07 subzerofun

SubpixelConv2d takes long time to compile, but it is a important part in the original paper. However, if you like, you can use resize-convolution i.e. use UpSamplingLayer and Conv2d instead of SubpixelConv2d as model.py --> SRGAN_g2() shows.

check this link for resize-conv http://distill.pub/2016/deconv-checkerboard/

zsdonghao avatar Jul 08 '17 21:07 zsdonghao

@zsdonghao Will this affect the output quality? I'd rather wait a few minutes and have a better model when the actual training takes a few days than using a shortcut and getting worse results.

BTW sorry that i mention this in this thread, but could you please take a look at this issue? It won't take up much your time! https://github.com/zsdonghao/Unsup-Im2Im/issues/9

subzerofun avatar Jul 08 '17 22:07 subzerofun

I cannot download the pretrained VGG19 model.

Andreababy avatar Jul 09 '17 03:07 Andreababy

Hmm. Probably need to figure a way of caching that part of the program then, if possible. I'm not too advanced on these matters yet to know if it's possible. On a 16-core machine perhaps it's less of an issue.

Good advice @subzerofun , cheers.

Yes, RAM disks are a standard approach to speeding things up if you have enough memory and the bottleneck is feeding the network. I'd just hold the data there, I don't expect putting the code itself there to help speed much (particularly on 'nix systems after accessing the files at least once, I think the system will automagically put them in there anyway).

rlayne avatar Jul 09 '17 03:07 rlayne

@Andreababy If you can't download from mega.nz try one of these solutions:

    1. Check if a plugin/extension is affecting the site or try another browser
    1. Use jDownloader, a download helper & organiser tool: http://jdownloader.org/jdownloader2. Install jDownloader, rightclick and copy the link and as soon as it's in the clipboard jDownloader will add it in your "LinkGrabber" tab. Rightclick on the mega.nz link (under the LinkGrabber tab), select "Start Downloads". It should now be under the "Downloads" tab. The file will be in your standard OS downloads directory.
    1. Download it here (only online till 16th of July 2017!): https://we.tl/8vgsapn2qN

subzerofun avatar Jul 09 '17 12:07 subzerofun

thank you very much发自我的华为手机-------- 原始邮件 --------主题:Re: [zsdonghao/SRGAN] Your hardware + GPU setup for training and average time to complete training (#3)发件人:subzerofun 收件人:zsdonghao/SRGAN 抄送:Andreababy [email protected],Mention @Andreababy If you can't downlod from mega.nz try one of these:

Check if a plugin/extension is affecting the site or try another browser

Use jDownloader, a download helper & organiser tool: http://jdownloader.org/jdownloader2. Install jDownoader, rightclick and copy the link and as soon as its in the clipboard jDownloader will add it in your "LinkGrabber" tab. Rightclick on the mega.nz link (under the LinkGrabber tab), select "Start Downloads". It should now be under the "Downloads" tab. The file will be in your standard OS downloads directory.

Download it here (only online till 16th of July 2017!): https://we.tl/8vgsapn2qN

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/zsdonghao/SRGAN","title":"zsdonghao/SRGAN","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/zsdonghao/SRGAN"}},"updates":{"snippets":[{"icon":"PERSON","message":"@subzerofun in #3: @Andreababy If you can't downlod from mega.nz try one of these:\r\n* 1) Check if a plugin/extension is affecting the site or try another browser\r\n* 2) Use jDownloader, a download helper \u0026 organiser tool: http://jdownloader.org/jdownloader2. Install jDownoader, rightclick and copy the link and as soon as its in the clipboard jDownloader will add it in your "LinkGrabber" tab. Rightclick on the mega.nz link (under the LinkGrabber tab), select "Start Downloads". It should now be under the "Downloads" tab. The file will be in your standard OS downloads directory.\r\n* 3) Download it here (only online till 16th of July 2017!): https://we.tl/8vgsapn2qN\r\n\r\n\r\n\r\n\r\n\r\n"}],"action":{"name":"View Issue","url":"https://github.com/zsdonghao/SRGAN/issues/3#issuecomment-313916493"}}}

Andreababy avatar Jul 09 '17 12:07 Andreababy

I encounter trouble again , i cannot download the "DIV2K - bicubic downscaling x4 competition " dataset, can you give me a link? thank you very much

Andreababy avatar Jul 10 '17 02:07 Andreababy

@Andreababy I too had problems finding the dataset – there was no download link on the competition site...

Here you go: https://data.vision.ee.ethz.ch/cvl/DIV2K/validation_release/DIV2K_test_LR_bicubic_X4.zip https://data.vision.ee.ethz.ch/cvl/DIV2K/DIV2K_train_HR.zip https://data.vision.ee.ethz.ch/cvl/DIV2K/DIV2K_train_LR_bicubic_X4.zip https://data.vision.ee.ethz.ch/cvl/DIV2K/validation_release/DIV2K_valid_HR.zip https://data.vision.ee.ethz.ch/cvl/DIV2K/DIV2K_valid_LR_bicubic_X4.zip

What hardware do you plan to use for training?

subzerofun avatar Jul 10 '17 05:07 subzerofun

GPU GTX108

Andreababy avatar Jul 10 '17 12:07 Andreababy

In the five link, Which one should I download? Which one is the author using?

Andreababy avatar Jul 10 '17 12:07 Andreababy

@Andreababy You'll need the train and validation HR and LR links, at minimum. You may as well get the test data too, but if I recall the code here only requires training and validation sets. There is no HR set for the testing set. If you need more data, simply scrape high resolution images from the web, and downsample them by a factor of 4 using bicubic resampling to create the corresponding "LR" set.

rlayne avatar Jul 10 '17 12:07 rlayne

Thank you very much

Andreababy avatar Jul 10 '17 13:07 Andreababy

@subzerofun Can you tell me how to run this SRGAN on multi-GPU system?

42binwang avatar Jul 11 '17 09:07 42binwang

When you use GTX1080Ti, how long have you spent during the whole training? @subzerofun

Andreababy avatar Jul 12 '17 03:07 Andreababy

Follow the readme, first,download the whole code,then download the VGG19 model as the readme shows,second, download the dataset,finally,run the code. @42binwang

Andreababy avatar Jul 12 '17 03:07 Andreababy

@Andreababy This code seems only use one GPU.

42binwang avatar Jul 12 '17 07:07 42binwang

@Andreababy my training is not finished yet, i'm at epoch 1700/2000. I will upload the weights once the training is complete. Should be ready on Friday.

subzerofun avatar Jul 13 '17 10:07 subzerofun

@42binwang don't know about Multi-GPU, but afaik tensorflow automatically uses all available gpu devices. I'm currently running the training with two gpus, but it doesn't seem much faster compared to my tests with one gpu...

subzerofun avatar Jul 13 '17 10:07 subzerofun

waiting your trained model. @subzerofun

Andreababy avatar Jul 13 '17 12:07 Andreababy

@subzerofun You need to allocate the task on each GPUs. Tensorflow will use only the first GPU in default.

42binwang avatar Jul 13 '17 13:07 42binwang

When i use my own picture, in the testing,i encountered this problem,what should i do? if i want to test my own picture,what should i @subzerofun @zsdonghao

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,1916,2400,256] [[Node: SRGAN_g/n256s1/2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](SRGAN_g/SRGAN_g/pixelshufflerx2/1/Relu, SRGAN_g/n256s1/2/W_conv2d/read)]] [[Node: SRGAN_g/out/Tanh/_349 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_835059_SRGAN_g/out/Tanh", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

Andreababy avatar Jul 14 '17 12:07 Andreababy

@Andreababy OOM means "Out of memory". You should try to free up more VRAM (close all other applications) or downscale the image. How much memory does your card have?

subzerofun avatar Jul 14 '17 23:07 subzerofun

Can you tell me how to test my own picture?发自我的华为手机

Andreababy avatar Jul 14 '17 23:07 Andreababy