Training on ImageNet 64x64

Open gulperii opened this issue 4 years ago • 1 comments

Hello,

I am using ImageNet 64x64 and run the code with the following command :

python BigGAN-PyTorch/train.py --dataset I64_hdf5 --parallel --shuffle --num_workers 8 --batch_size 128 --num_G_accumulations 1 --num_D_accumulations 1 --num_D_steps 1--G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 32 --D_attn 32 --G_nl relu --D_nl relu --SN_eps 1e-8 --BN_eps 1e-5 --adam_eps 1e-8 --G_ortho 0.0 --G_init xavier --D_init xavier --G_eval_mode --G_ch 32 --D_ch 32 --ema --use_ema --ema_start 2000 --test_every 5000 --save_every 1000 --num_best_copies 5 --num_save_copies 2 --seed 0 --which_best FID --num_iters 200000 --num_epochs 1000 --embedding inceptionv3 --density_measure gaussian --retention_ratio 50

and getting this error:

File "train.py", line 229, in main() File "train.py", line 226, in main run(config) File "train.py", line 184, in run metrics = train(x, y) File "/BigGAN-PyTorch/train_fns.py", line 42, in train split_D=config['split_D']) File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 140, in forward return self.module(*inputs, **kwargs) File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/BigGAN-PyTorch/BigGAN.py", line 443, in forward D_out = self.D(D_input, D_class) File "/miniconda3/envs/biggan2-env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/BigGAN-PyTorch/BigGAN.py", line 403, in forward out = out + torch.sum(self.embed(y) * h, 1, keepdim=True) RuntimeError: CUDA error: device-side assert triggered

The interesting thing is when I create a "mini dataset" by randomly selecting 500 images per label from the original ImageNet dataset, code runs fine. What could be the problem? How can I solve this issue?

May 21 '21 18:05 gulperii

This is quite strange, I haven't seen this behaviour before. Is it possible that self.embed(y) is receiving values greater than the number of classes in the dataset? That seems to be a particularly common failure case that produces this error.

Otherwise you could try running with the flag CUDA_LAUNCH_BLOCKING=1 (if you haven't already) for a more informative stack trace.

May 22 '21 11:05 TDeVries