big-discriminator-batch-spoofing-gan
big-discriminator-batch-spoofing-gan copied to clipboard
Training not progressing 1024x1024 on multigpu
I run a few hours on 8GPUs wihtout any progress. Each sample is pixelwise copy of each other in all layers.
4x4
64x64
setup
git clone [email protected]:akanimax/BBMSG-GAN.git
conda create -n bbmsg python==3.7
conda activate bbmsg
conda install pytorch torchvision cudatoolkit=10.0 cudnn scipy==1.2.0 tensorboard -c pytorch
pip install tensorboardX tqdm
training
# calc real fid stats before training
python ../BBMSG-GAN/sourcecode/train.py \
--images_dir="$IMGS" \
--sample_dir="$SAMPLES" \
--model_dir="$MODELS" \
--depth=9 \
--batch_size=24 \
--num_samples=36 \
--feedback_factor=5 \
--checkpoint_factor=1 \
--num_epochs=50000 \
--num_workers=90 \
--log_fid_values=True \
--fid_temp_folder=/tmp/fid_tmp \
--fid_real_stats="$FID" \
--fid_batch_size=64 \
--num_fid_images=5000
@tomasheiskanen, Thanks a lot for the detailed description of the issue. This seems to be an intricate bug. I'll have to reproduce the same problem in order to guage what exactly is going wrong. Here are a few things you could try till then:
1.) The implementation was made using python 3.5.6 and Pytorch 1.0.0 (I should mention it on the readme)
2.) Could you try turning off the spoofing factor i.e. spoofing_factor==1
?
3.) Also try running on single GPU? See if that works, may be this is a multi-gpu issue.
4.) Check if two consecutive logged images have at least some difference. If they are exact copy of each other, then, gradients are not being propagated properly.
I admit this is a lot of work :smile:. But, I'll definitely also find out what went wrong. I am unfortunately getting less time these days to address problems like this. Please feel free to open a PR if you find the bug.
5.) If it is not too much: could you also try BMSG-GAN? The method for running is same as this, with fewer options such as spoofing factor.
Hope this helps. Thanks a lot :smile:!
cheers :beers:! @akanimax
@akanimax I've been running this on one gpu as well with following settings and seems to be training.
conda create -n bbmsg python=3.5.6 -y
source activate bbmsg
conda install pytorch=1.0.0 torchvision cuda100 cudatoolkit=10.0 -c pytorch -y
pip install tensorboardX tqdm scipy==1.2.0 tensorboard
python BBMSG-GAN/sourcecode/train.py \
--images_dir="$IMGS" \
--sample_dir="$SAMPLES" \
--model_dir="$MODELS" \
--depth=9 \
--batch_size=3 \
--spoofing_factor=2 \
--num_samples=16 \
--feedback_factor=3 \
--checkpoint_factor=10 \
--num_epochs=5000 \
--num_workers=6 \
--log_fid_values=True \
--fid_temp_folder=/tmp/fid_tmp \
--fid_real_stats="$FID" \
--fid_batch_size=8 \
--num_fid_images=5000
@tomasheiskanen, Alright, so both of my repositories are basically failing on Multi-GPU settings. :laughing: :rofl: Damn! I need to buckle up.... bug-fixing lies ahead.
cheers :beers:! @akanimax
I actually tried it in my single GPU and it also didn't progress. The original BMSG-GAN gives good results with the same set and parameters.
@brunovianna,
It is a petty problem actually. The default spoofing_factor
is set to a very high value and that causes the training to collapse. I so should fix it right? But procrastination gets the better of me :laughing:!
Could you try to lower the spoofing_factor
?
cheers :beers:! @akanimax