online-neural-doodle
online-neural-doodle copied to clipboard
Training a model fails
Hi, I tried to run the command from the tutorial for model training, but it failed with the following error:
CUDA_VISIBLE_DEVICES=0 th feedforward_neural_doodle.lua -model_name skip_noise_4 -masks_hdf5 data/starry/gen_doodles.hdf5 -batch_size 4 -num_mask_noise_times 0 -num_noise_channels 0 -learning_rate 1e-1 -half false
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/hdf5/group.lua:312: HDF5Group:read() - no such child 'style_img' for [HDF5Group 33554432 /]
stack traceback:
[C]: in function 'error'
/root/torch/install/share/lua/5.1/hdf5/group.lua:312: in function 'read'
feedforward_neural_doodle.lua:49: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
any ideas why hdf5 might fail with such error?
did you generate hdf5 file first?
yes, initially I thought that something with the generation didn't go good - since this script never completed:
python generate.py --n_jobs 30 --n_colors 4 --style_image data/starry/style.png --style_mask data/starry/style_mask.png --out_hdf5 data/starry/gen_doodles.hdf5
even though a new hdf5 file was generated
So I decided to try the sample command that you have put in the README - so it should use the sample hdf5 file from the repo, unfortunately it made no difference.
Is it possible that the two fail due to bad hdf5 setup?
there's no sample hdf5 file, since it is too large. You should let it work till it finishes.
thanks, I'll try that! How much time does it take on ur setup?
Do you advise to increase the jobs? I'm using a Tesla K10 setup
I managed to get it working, unfortunately it looks like the VRAM (3.5GB) is not enough. What's the best way to reduce the memory footprint?
p.s.: I'm familiar with Johnson's implementation and know what I can do there, but I still haven't read your blogpost and the code documentation :(
Edit 1: From first glance - looks like reducing the batch_size and n_colors might do the trick? I increased them to 8, maybe that's why it fails..
Edit 2: Is it even possible to squeeze the training into 3.5GB? I started going through the code and I noticed that you are already doing a lot of the memory optimizations (e.g. using cudnn and the ADAM optimizer)..
Try doing batch_size = 1, do not change ncolors, you can also downsize the image to 256x256 for example
looks like batch_size=1 did the trick, I previously tried with 2 and 3 with no success. Does this affect the quality or just the speed of the training?
The quality will be ok, I used batch_size = 1, but at test time you need to experiment with midel:evaluate() or model:training()
BTW, do you recommend this repo for artistic neural transfer? probably to do it well there should be some semantic analysis that determines the masks :? Is there any other approach that you can recommend