online-neural-doodle icon indicating copy to clipboard operation
online-neural-doodle copied to clipboard

Training a model fails

Open randomrandom opened this issue 9 years ago • 9 comments

Hi, I tried to run the command from the tutorial for model training, but it failed with the following error:

 CUDA_VISIBLE_DEVICES=0 th feedforward_neural_doodle.lua -model_name skip_noise_4 -masks_hdf5 data/starry/gen_doodles.hdf5 -batch_size 4 -num_mask_noise_times 0 -num_noise_channels 0 -learning_rate 1e-1 -half false
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/hdf5/group.lua:312: HDF5Group:read() - no such child 'style_img' for [HDF5Group 33554432 /]
stack traceback:
    [C]: in function 'error'
    /root/torch/install/share/lua/5.1/hdf5/group.lua:312: in function 'read'
    feedforward_neural_doodle.lua:49: in main chunk
    [C]: in function 'dofile'
    /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

any ideas why hdf5 might fail with such error?

randomrandom avatar Jul 15 '16 20:07 randomrandom

did you generate hdf5 file first?

DmitryUlyanov avatar Jul 16 '16 08:07 DmitryUlyanov

yes, initially I thought that something with the generation didn't go good - since this script never completed:

python generate.py --n_jobs 30 --n_colors 4 --style_image data/starry/style.png --style_mask data/starry/style_mask.png --out_hdf5 data/starry/gen_doodles.hdf5 even though a new hdf5 file was generated

So I decided to try the sample command that you have put in the README - so it should use the sample hdf5 file from the repo, unfortunately it made no difference.

Is it possible that the two fail due to bad hdf5 setup?

randomrandom avatar Jul 16 '16 11:07 randomrandom

there's no sample hdf5 file, since it is too large. You should let it work till it finishes.

DmitryUlyanov avatar Jul 16 '16 17:07 DmitryUlyanov

thanks, I'll try that! How much time does it take on ur setup?

Do you advise to increase the jobs? I'm using a Tesla K10 setup

randomrandom avatar Jul 16 '16 17:07 randomrandom

I managed to get it working, unfortunately it looks like the VRAM (3.5GB) is not enough. What's the best way to reduce the memory footprint?

p.s.: I'm familiar with Johnson's implementation and know what I can do there, but I still haven't read your blogpost and the code documentation :(

Edit 1: From first glance - looks like reducing the batch_size and n_colors might do the trick? I increased them to 8, maybe that's why it fails..

Edit 2: Is it even possible to squeeze the training into 3.5GB? I started going through the code and I noticed that you are already doing a lot of the memory optimizations (e.g. using cudnn and the ADAM optimizer)..

randomrandom avatar Jul 16 '16 18:07 randomrandom

Try doing batch_size = 1, do not change ncolors, you can also downsize the image to 256x256 for example

DmitryUlyanov avatar Jul 16 '16 21:07 DmitryUlyanov

looks like batch_size=1 did the trick, I previously tried with 2 and 3 with no success. Does this affect the quality or just the speed of the training?

randomrandom avatar Jul 17 '16 06:07 randomrandom

The quality will be ok, I used batch_size = 1, but at test time you need to experiment with midel:evaluate() or model:training()

DmitryUlyanov avatar Jul 17 '16 08:07 DmitryUlyanov

BTW, do you recommend this repo for artistic neural transfer? probably to do it well there should be some semantic analysis that determines the masks :? Is there any other approach that you can recommend

randomrandom avatar Jul 18 '16 17:07 randomrandom