big-discriminator-batch-spoofing-gan icon indicating copy to clipboard operation
big-discriminator-batch-spoofing-gan copied to clipboard

Training not progressing 1024x1024 on multigpu

Open tomasheiskanen opened this issue 5 years ago • 5 comments

I run a few hours on 8GPUs wihtout any progress. Each sample is pixelwise copy of each other in all layers.

4x4 image

64x64 image

setup

git clone [email protected]:akanimax/BBMSG-GAN.git
conda create -n bbmsg python==3.7
conda activate bbmsg
conda install pytorch torchvision cudatoolkit=10.0 cudnn scipy==1.2.0 tensorboard -c pytorch
pip install tensorboardX tqdm

training

# calc real fid stats before training
python ../BBMSG-GAN/sourcecode/train.py \
	--images_dir="$IMGS" \
       	--sample_dir="$SAMPLES" \
       	--model_dir="$MODELS" \
       	--depth=9 \
       	--batch_size=24 \
	--num_samples=36 \
	--feedback_factor=5 \
	--checkpoint_factor=1 \
	--num_epochs=50000 \
	--num_workers=90 \
       	--log_fid_values=True \
	--fid_temp_folder=/tmp/fid_tmp \
	--fid_real_stats="$FID" \
	--fid_batch_size=64 \
	--num_fid_images=5000

tomasheiskanen avatar Oct 15 '19 22:10 tomasheiskanen

@tomasheiskanen, Thanks a lot for the detailed description of the issue. This seems to be an intricate bug. I'll have to reproduce the same problem in order to guage what exactly is going wrong. Here are a few things you could try till then:

1.) The implementation was made using python 3.5.6 and Pytorch 1.0.0 (I should mention it on the readme) 2.) Could you try turning off the spoofing factor i.e. spoofing_factor==1? 3.) Also try running on single GPU? See if that works, may be this is a multi-gpu issue. 4.) Check if two consecutive logged images have at least some difference. If they are exact copy of each other, then, gradients are not being propagated properly. I admit this is a lot of work :smile:. But, I'll definitely also find out what went wrong. I am unfortunately getting less time these days to address problems like this. Please feel free to open a PR if you find the bug. 5.) If it is not too much: could you also try BMSG-GAN? The method for running is same as this, with fewer options such as spoofing factor.

Hope this helps. Thanks a lot :smile:!

cheers :beers:! @akanimax

akanimax avatar Oct 16 '19 07:10 akanimax

@akanimax I've been running this on one gpu as well with following settings and seems to be training.

conda create -n bbmsg python=3.5.6 -y
source activate bbmsg
conda install pytorch=1.0.0 torchvision cuda100 cudatoolkit=10.0 -c pytorch -y
pip install tensorboardX tqdm scipy==1.2.0 tensorboard
python BBMSG-GAN/sourcecode/train.py \
    --images_dir="$IMGS" \
    --sample_dir="$SAMPLES" \
    --model_dir="$MODELS" \
    --depth=9 \
    --batch_size=3 \
    --spoofing_factor=2 \
    --num_samples=16 \
    --feedback_factor=3 \
    --checkpoint_factor=10 \
    --num_epochs=5000 \
    --num_workers=6 \
    --log_fid_values=True \
    --fid_temp_folder=/tmp/fid_tmp \
    --fid_real_stats="$FID" \
    --fid_batch_size=8 \
    --num_fid_images=5000

tomasheiskanen avatar Oct 23 '19 16:10 tomasheiskanen

@tomasheiskanen, Alright, so both of my repositories are basically failing on Multi-GPU settings. :laughing: :rofl: Damn! I need to buckle up.... bug-fixing lies ahead.

cheers :beers:! @akanimax

akanimax avatar Oct 24 '19 11:10 akanimax

I actually tried it in my single GPU and it also didn't progress. The original BMSG-GAN gives good results with the same set and parameters.

brunovianna avatar Feb 13 '20 17:02 brunovianna

@brunovianna, It is a petty problem actually. The default spoofing_factor is set to a very high value and that causes the training to collapse. I so should fix it right? But procrastination gets the better of me :laughing:! Could you try to lower the spoofing_factor?

cheers :beers:! @akanimax

akanimax avatar Feb 13 '20 17:02 akanimax