SRGAN
SRGAN copied to clipboard
Your hardware + GPU setup for training and average time to complete training
First off, thanks for publishing the code! I've looked at a lot of SRGAN implementations on github but yours is the first with an understandable code structure for Tensorflow beginners.
I just wondered how your hardware setup looked like: CPU model, which GPU – and how many?
I'm using macOS 10.12.5 with an i7 6700k and two old crappy GPUs (for now, my GTX 1080 Ti arrives on Thursday! 🎉). Images get loaded from SSD (Read: ~600 MB/s).
Model | GPU Memory | CUDA Cores | CUDA V. | GPU Max Clock rate | Mem Clock rate | Mem Bus Width: |
---|---|---|---|---|---|---|
GTX 780 | 6144 MB | 2304 | 3.5 | 902 MHz | 3004 Mhz | 384-bit |
GTX 770 | 2048 MB | 1536 | 3.0 | 1110 MHz | 3505 Mhz | 256-bit |
SOON: | ||||||
GTX 1080 Ti | 11.264 MB | 3584 | 6.1 | ~1700 MHz | 11010 Mhz | 352 bit |
So for now i just have the two 700 series cards available. Since Tensorflow 1.2 doesn't offer GPU support for macOS i had to revert to 1.1.
Is it normal that i have to wait around 10min until the training actually starts?
It takes 10min from entering the cmd until i see the Epochs counting down:
Epoch [60/2000] 199 time: 2.3310s, d_loss: 0.13401291 g_loss: 0.03180185 (mse: 0.015155 vgg: 0.014024 adv: 0.002623)
I have completed the first 100 Epochs for SRGAN init, and am now @ Epoch Nr. 60/2000 of the actual training. But it would take me ~11 days to get to step 2000!
Do you think the 1080 Ti could cut the training time down to a bearable amount?
How long did it take for you to complete the 2000 Epochs?
And one more (so i don't have to open a separate issue): When i'm done with training can i input an image that's not from the training set to upscale it? Could you please explain what i would need to change/add to the code?
Hi, I am using Titan X Pascal, 32GB RAM. I think 1080 Ti should be ok ~~
It is unnormal to wait 10 mins to let it start to train. May be the compile time of SubpixelConv is very long in your machine? It may cause by CPU.
It takes me 2 days to complete 2000 epochs when I use DIV2K dataset.
After you train the model, you can input any image to upscale it, check the def evalution()
in main.py
Thanks for the quick answer! Can't figure out the reason for the long setup time...
The first steps go by pretty fast:
[TL] FlattenLayer VGG19/flatten: 25088
[TL] DenseLayer VGG19/fc6: 4096 relu
[TL] DenseLayer VGG19/fc7: 4096 relu
[TL] DenseLayer VGG19/fc8: 1000 identity
build model finished: **0.553778s**
build model started
[TL] FlattenLayer VGG19/flatten: 25088
[TL] DenseLayer VGG19/fc6: 4096 relu
[TL] DenseLayer VGG19/fc7: 4096 relu
[TL] DenseLayer VGG19/fc8: 1000 identity
build model finished: **0.136154s**
But then it freezes for a few minutes at
[TL] SubpixelConv2d SRGAN_g/pixelshufflerx2/2: scale: 2 n_out_channel: 64 act: relu
[TL] Conv2dLayer SRGAN_g/out: shape:[1, 1, 64, 3] strides:[1, 1, 1, 1] pad:SAME act:tanh
[*] geting variables with SRGAN_g
and then again for a few minutes at
got 31: SRGAN_d/res/bn3/gamma:0 (512,)
got 32: SRGAN_d/ho/dense/W:0 (18432, 1)
got 33: SRGAN_d/ho/dense/b:0 (1,)
Nevertheless, when it starts the speed is OK (as good as it can be with my 700-series cards). I'm really excited to swap out the 770 on Thursday and see how well the 1080 Ti will do!
Can you download the code?
i cannot download it sucessful. Do you know the reason? Thank you.
What code do you mean specifically? The whole repository? Just enter terminal, cd into a folder where you store your projects and then execute:
git clone https://github.com/zsdonghao/SRGAN.git
The output should be:
Cloning into 'SRGAN'...
remote: Counting objects: 228, done.
remote: Compressing objects: 100% (13/13), done.
Receiving objects: 100% (228/228), 117.82 MiB | 2.69 MiB/s, done.
Resolving deltas: 100% (111/111), done.
@subzerofun I have the same concern as you. The model takes quite some time to actually get down to training. I suspect it is a CPU thing, since I'm running this on my home PC which packs significantly less CPU cores than a server. Did you find a way to speed things up?
No, unfortunately not... The first time i tried my CPU speed was reaching 3,8 Ghz with Turbo Boost. I forgot to disable Speed Stepping in the BIOS and went back, changed the CPU speed to be at 4.2 Ghz at all times, turned off Speed Stepping and now the training setup takes 2-3 minutes less.
I was already at epoch Nr. 570 of 2000 and then my SSDs disk space was full... Argh. I lost all my progress because the .npz file with the weights is now corrupt...
So i have to start again from epoch 60... two days lost for nothing.
So make sure you backup your .npz files every now and then and note the epoch you were at if you pause training so that you don't lose your progress like i did!
I will try to compile Tensorflow from source since i always get messages
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I don't know if that will speed up things, but i will report if it does.
Another thing which may help is creating a RAM disk, copying all the repo files there and try again.
In the CUDA documentation there is an option to set your cache folder, which you could also point to the RAM disk.
But since it's probably a CPU related issue i don't know how much this is gonna affect things.
SubpixelConv2d
takes long time to compile, but it is a important part in the original paper.
However, if you like, you can use resize-convolution i.e. use UpSamplingLayer
and Conv2d
instead of SubpixelConv2d
as model.py --> SRGAN_g2()
shows.
check this link for resize-conv http://distill.pub/2016/deconv-checkerboard/
@zsdonghao Will this affect the output quality? I'd rather wait a few minutes and have a better model when the actual training takes a few days than using a shortcut and getting worse results.
BTW sorry that i mention this in this thread, but could you please take a look at this issue? It won't take up much your time! https://github.com/zsdonghao/Unsup-Im2Im/issues/9
I cannot download the pretrained VGG19 model.
Hmm. Probably need to figure a way of caching that part of the program then, if possible. I'm not too advanced on these matters yet to know if it's possible. On a 16-core machine perhaps it's less of an issue.
Good advice @subzerofun , cheers.
Yes, RAM disks are a standard approach to speeding things up if you have enough memory and the bottleneck is feeding the network. I'd just hold the data there, I don't expect putting the code itself there to help speed much (particularly on 'nix systems after accessing the files at least once, I think the system will automagically put them in there anyway).
@Andreababy If you can't download from mega.nz try one of these solutions:
-
- Check if a plugin/extension is affecting the site or try another browser
-
- Use jDownloader, a download helper & organiser tool: http://jdownloader.org/jdownloader2. Install jDownloader, rightclick and copy the link and as soon as it's in the clipboard jDownloader will add it in your "LinkGrabber" tab. Rightclick on the mega.nz link (under the LinkGrabber tab), select "Start Downloads". It should now be under the "Downloads" tab. The file will be in your standard OS downloads directory.
-
- Download it here (only online till 16th of July 2017!): https://we.tl/8vgsapn2qN
thank you very much发自我的华为手机-------- 原始邮件 --------主题:Re: [zsdonghao/SRGAN] Your hardware + GPU setup for training and average time to complete training (#3)发件人:subzerofun 收件人:zsdonghao/SRGAN 抄送:Andreababy [email protected],Mention @Andreababy If you can't downlod from mega.nz try one of these:
Check if a plugin/extension is affecting the site or try another browser
Use jDownloader, a download helper & organiser tool: http://jdownloader.org/jdownloader2. Install jDownoader, rightclick and copy the link and as soon as its in the clipboard jDownloader will add it in your "LinkGrabber" tab. Rightclick on the mega.nz link (under the LinkGrabber tab), select "Start Downloads". It should now be under the "Downloads" tab. The file will be in your standard OS downloads directory.
Download it here (only online till 16th of July 2017!): https://we.tl/8vgsapn2qN
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/zsdonghao/SRGAN","title":"zsdonghao/SRGAN","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/zsdonghao/SRGAN"}},"updates":{"snippets":[{"icon":"PERSON","message":"@subzerofun in #3: @Andreababy If you can't downlod from mega.nz try one of these:\r\n* 1) Check if a plugin/extension is affecting the site or try another browser\r\n* 2) Use jDownloader, a download helper \u0026 organiser tool: http://jdownloader.org/jdownloader2. Install jDownoader, rightclick and copy the link and as soon as its in the clipboard jDownloader will add it in your "LinkGrabber" tab. Rightclick on the mega.nz link (under the LinkGrabber tab), select "Start Downloads". It should now be under the "Downloads" tab. The file will be in your standard OS downloads directory.\r\n* 3) Download it here (only online till 16th of July 2017!): https://we.tl/8vgsapn2qN\r\n\r\n\r\n\r\n\r\n\r\n"}],"action":{"name":"View Issue","url":"https://github.com/zsdonghao/SRGAN/issues/3#issuecomment-313916493"}}}
I encounter trouble again , i cannot download the "DIV2K - bicubic downscaling x4 competition " dataset, can you give me a link? thank you very much
@Andreababy I too had problems finding the dataset – there was no download link on the competition site...
Here you go: https://data.vision.ee.ethz.ch/cvl/DIV2K/validation_release/DIV2K_test_LR_bicubic_X4.zip https://data.vision.ee.ethz.ch/cvl/DIV2K/DIV2K_train_HR.zip https://data.vision.ee.ethz.ch/cvl/DIV2K/DIV2K_train_LR_bicubic_X4.zip https://data.vision.ee.ethz.ch/cvl/DIV2K/validation_release/DIV2K_valid_HR.zip https://data.vision.ee.ethz.ch/cvl/DIV2K/DIV2K_valid_LR_bicubic_X4.zip
What hardware do you plan to use for training?
GPU GTX108
In the five link, Which one should I download? Which one is the author using?
@Andreababy You'll need the train and validation HR and LR links, at minimum. You may as well get the test data too, but if I recall the code here only requires training and validation sets. There is no HR set for the testing set. If you need more data, simply scrape high resolution images from the web, and downsample them by a factor of 4 using bicubic resampling to create the corresponding "LR" set.
Thank you very much
@subzerofun Can you tell me how to run this SRGAN on multi-GPU system?
When you use GTX1080Ti, how long have you spent during the whole training? @subzerofun
Follow the readme, first,download the whole code,then download the VGG19 model as the readme shows,second, download the dataset,finally,run the code. @42binwang
@Andreababy This code seems only use one GPU.
@Andreababy my training is not finished yet, i'm at epoch 1700/2000. I will upload the weights once the training is complete. Should be ready on Friday.
@42binwang don't know about Multi-GPU, but afaik tensorflow automatically uses all available gpu devices. I'm currently running the training with two gpus, but it doesn't seem much faster compared to my tests with one gpu...
waiting your trained model. @subzerofun
@subzerofun You need to allocate the task on each GPUs. Tensorflow will use only the first GPU in default.
When i use my own picture, in the testing,i encountered this problem,what should i do? if i want to test my own picture,what should i @subzerofun @zsdonghao
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,1916,2400,256] [[Node: SRGAN_g/n256s1/2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](SRGAN_g/SRGAN_g/pixelshufflerx2/1/Relu, SRGAN_g/n256s1/2/W_conv2d/read)]] [[Node: SRGAN_g/out/Tanh/_349 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_835059_SRGAN_g/out/Tanh", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
@Andreababy OOM
means "Out of memory". You should try to free up more VRAM (close all other applications) or downscale the image. How much memory does your card have?
Can you tell me how to test my own picture?发自我的华为手机