CLD-SGM icon indicating copy to clipboard operation
CLD-SGM copied to clipboard

inf Loss

Open qsh-zh opened this issue 2 years ago • 6 comments

Thanks for open sourcing the awesome repo. However, I am running experiments and consistently get the inf loss no matter how I change the hyperparameters (even set learning rate to zero, reduce model size)

To reproduce

# clone repo and setup python and deps
python main.py -cc configs/default_cifar10.txt -sc configs/specific_cifar10.txt --root $(pwd) --mode train --workdir logs/debug --n_gpus_per_node 1 --training_batch_size 64 --testing_batch_size 64 --sampling_batch_size 64 --log_freq 1

Env

PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 8.4.0-3ubuntu2) 8.4.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000

Nvidia driver version: 510.60.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-fid==0.2.1
[pip3] torch==1.8.1+cu111
[pip3] torchdiffeq==0.2.3
[pip3] torchvision==0.9.1+cu111
[conda] torch                     1.8.1+cu111              pypi_0    pypi
[conda] torchdiffeq               0.2.3                    pypi_0    pypi
[conda] torchvision               0.9.1+cu111              pypi_0    pypi

Output

WARNING - module_wrapper.py - 2022-05-04 07:51:58,366 - From /home/qzhang419/anaconda3/envs/cld/lib/python3.8/site-packages/tensorflow_gan/python/estimator/tpu_gan_estimator.py:42: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

INFO - run_lib.py - 2022-05-04 07:51:58,368 - Namespace(attention_type='ddpm', attn_resolutions='16', autocast_eval=True, autocast_train=True, beta0=4.0, beta1=0.0, beta_type='linear', cc='configs/default_cifar10.txt', center_image=True, ch_mult='1,2,2,2', checkpoint=None, ckpt_file=None, cld_objective='hsm', cont_nbr=None, conv_size=3, data_dim=None, data_location=None, dataset='cifar10', denoising=True, device=device(type='cuda', index=0), distributed=True, dropout=0.1, ema_rate=0.9999, embedding_type='fourier', eval_density=False, eval_density_npts=101, eval_fid=False, eval_fid_samples=50000, eval_folder=None, eval_freq=20000, eval_hist_samples=100000, eval_iw_likelihood=False, eval_jacobian_norm=False, eval_likelihood=False, eval_loss=False, eval_loss_variance=False, eval_loss_variance_images=1, eval_sample=False, eval_sample_hist=False, eval_sample_samples=1, eval_seed=0, eval_threshold=1, fid_freq=50000, fid_samples_training=20000, fid_threshold=100000, fir_kernel='1,3,3,1', fourier_scale=16, gamma=0.04, global_rank=0, global_size=1, grad_clip=1.0, image_channels=3, image_size=32, init_scale=0.0, is_image=True, learning_rate=0.0002, likelihood_atol=1e-05, likelihood_eps=1e-05, likelihood_freq=50000, likelihood_hutchinson_type='rademacher', likelihood_rtol=1e-05, likelihood_solver='scipy_solver', likelihood_solver_options={'solver': 'RK45'}, likelihood_threshold=2000000, local_rank=0, log_freq=1, loss_eps=1e-05, m_inv=4.0, master_address='127.0.0.1', master_port=6020, mixed_score=True, mode='train', n_channels=128, n_discrete_steps=None, n_eval_batches=1, n_gpus_per_node=1, n_likelihood_batches=1, n_nodes=1, n_resblocks=8, n_train_iters=800000, n_warmup_iters=100000, name='ncsnpp', node_rank=0, nonlinearity='swish', normalization='GroupNorm', numerical_eps=1e-09, optimizer='Adam', overwrite=False, progressive='none', progressive_combine='sum', progressive_input='residual', resamp_with_conv=True, resblock_type='biggan', root='/tmp/CLD-SGM', sampling_atol=1e-05, sampling_batch_size=64, sampling_eps=0.001, sampling_method='ode', sampling_rtol=1e-05, sampling_solver='scipy_solver', sampling_solver_options={'solver': 'RK45'}, save_freq=50000, save_threshold=300000, sc='configs/specific_cifar10.txt', sde='cld', seed=0, skip_rescale=True, snapshot_freq=10000, snapshot_threshold=1, sscs_num_stab=0.0, striding='linear', testing_batch_size=64, training_batch_size=64, use_fir=True, weight_decay=0.0, weighting='reweightedv2', workdir='logs/debug')
INFO - run_lib.py - 2022-05-04 07:52:02,146 - Number of trainable parameters in model: 107593859
INFO - run_lib.py - 2022-05-04 07:52:03,269 - Number of total iterations: 800000
INFO - resolver.py - 2022-05-04 07:52:03,379 - Using /tmp/tfhub_modules to cache modules.
INFO - run_lib.py - 2022-05-04 07:52:10,876 - Starting training at step 0
INFO - run_lib.py - 2022-05-04 07:52:13,258 - Iter 1/800000 Loss: inf Time: 1.815
INFO - distributed.py - 2022-05-04 07:52:13,280 - Reducer buckets have been rebuilt in this iteration.
INFO - run_lib.py - 2022-05-04 07:52:13,710 - Iter 2/800000 Loss: inf Time: 0.451
INFO - run_lib.py - 2022-05-04 07:52:14,115 - Iter 3/800000 Loss: inf Time: 0.403
INFO - run_lib.py - 2022-05-04 07:52:14,510 - Iter 4/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:14,908 - Iter 5/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:15,304 - Iter 6/800000 Loss: inf Time: 0.395
INFO - run_lib.py - 2022-05-04 07:52:15,722 - Iter 7/800000 Loss: inf Time: 0.417
INFO - run_lib.py - 2022-05-04 07:52:16,128 - Iter 8/800000 Loss: inf Time: 0.405
INFO - run_lib.py - 2022-05-04 07:52:16,527 - Iter 9/800000 Loss: inf Time: 0.398
INFO - run_lib.py - 2022-05-04 07:52:16,925 - Iter 10/800000 Loss: inf Time: 0.396
INFO - run_lib.py - 2022-05-04 07:52:17,324 - Iter 11/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:17,723 - Iter 12/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:18,118 - Iter 13/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:18,513 - Iter 14/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:18,925 - Iter 15/800000 Loss: inf Time: 0.411
INFO - run_lib.py - 2022-05-04 07:52:19,328 - Iter 16/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:19,728 - Iter 17/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:20,132 - Iter 18/800000 Loss: inf Time: 0.403
INFO - run_lib.py - 2022-05-04 07:52:20,532 - Iter 19/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:20,929 - Iter 20/800000 Loss: inf Time: 0.396
INFO - run_lib.py - 2022-05-04 07:52:21,330 - Iter 21/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:21,730 - Iter 22/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:22,132 - Iter 23/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:22,532 - Iter 24/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:22,935 - Iter 25/800000 Loss: inf Time: 0.402
INFO - run_lib.py - 2022-05-04 07:52:23,336 - Iter 26/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:23,736 - Iter 27/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:24,136 - Iter 28/800000 Loss: inf Time: 0.398
INFO - run_lib.py - 2022-05-04 07:52:24,543 - Iter 29/800000 Loss: inf Time: 0.406
INFO - run_lib.py - 2022-05-04 07:52:24,945 - Iter 30/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:25,347 - Iter 31/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:25,749 - Iter 32/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:26,147 - Iter 33/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:26,548 - Iter 34/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:26,950 - Iter 35/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:27,351 - Iter 36/800000 Loss: inf Time: 0.400

qsh-zh avatar May 04 '22 07:05 qsh-zh

Hi Qinsheng,

Thanks for reaching out. That's actually super weird. I have the same library versions and it works fine for me.

I don't had any great suggestion, but maybe you can try to turn off autocast and see if that makes a difference? You can do so by setting

autocast_train = false autocast_eval = false

in the config file (configs/default_cifar10.txt).

timudk avatar May 09 '22 19:05 timudk

Hi Qinsheng,

Have you solved the inf loss problem? Can you share the solution, since I have met the same problem?

ShiZiqiang avatar Jun 23 '22 08:06 ShiZiqiang

@ShiZiqiang Unfortunately no. I can not fix the issue. I gave up PyTorch and used jax. Maybe you are interested in this repo.

qsh-zh avatar Jun 23 '22 14:06 qsh-zh

@ShiZiqiang Unfortunately no. I can not fix the issue. I gave up PyTorch and used jax. Maybe you are interested in this repo.

Hi, Qinsheng,

Thank you so much. I will try your awesome gDDIM and DEIS.

ShiZiqiang avatar Jun 24 '22 00:06 ShiZiqiang

@ShiZiqiang Unfortunately no. I can not fix the issue. I gave up PyTorch and used jax. Maybe you are interested in this repo.

Hi, Qinsheng, Cannot wait to see your gDDIM repo. 催更 O(∩_∩)O

ShiZiqiang avatar Jun 24 '22 01:06 ShiZiqiang

INF in the 3090, not in the 3070TI and TITA V100.

yanjingke avatar Oct 30 '22 08:10 yanjingke