vq-vae-2-pytorch
vq-vae-2-pytorch copied to clipboard
[train_vqvae] multiple gpu seems not work as expected
Thank you for sharing this great code.
I have used infovae as a subsitute of beta-vae and traditional vae (beta=1). However, I think that your vq-vae-2 better reconstructs images.
Unfortunately, when I used multiple gpus
#SBATCH --gres=gpu:4
python /people/kimd999/script/python/cryoEM/vq-vae-2-pytorch/train_vqvae.py /people/kimd999/MARScryo/dn/data/full/PDX/coexp/input --size 256 --n_gpu 4
it reconstructed image poorly (blank images), and didn't minimize mse that much (mse: 0.01311 after 32 epochs).
While using single gpu
#SBATCH --gres=gpu:1
python /people/kimd999/script/python/cryoEM/vq-vae-2-pytorch/train_vqvae.py /people/kimd999/MARScryo/dn/data/full/PDX/coexp/input --size 256
reconstructed images better (almost as if the input image), and minimized mse better (mse: 0.00583 after 12 epochs).
Consequently, using 1 gpu technically "runs faster" with respect to quality performance although it took more time per epoch (4 hr/ epoch) than 4 gpus' 1.4 hr/ epoch.
I wonder whether you have experienced like this as well?
I didn't saw that kind of the problems. Both distributed or single gpu training results similar results I think.