SWALP icon indicating copy to clipboard operation
SWALP copied to clipboard

ValueError: cannot convert float NaN to integer

Open Lucieno opened this issue 5 years ago • 4 comments

I am trying to run the example VGG16LP with dataset CIFAR10. However, the following error occurs: ValueError: cannot convert float NaN to integer It would be greatly appreciated if anyone can help. The details of the execution, error, and the environment are shown as following.

(swalp_cuda9) ➜  SWALP git:(master) ✗ seed=100                                      # Specify experiment seed.
bash exp/block_vgg_swa.sh CIFAR10 ${seed}     # SWALP training on VGG16 with Small-block BFP in CIFAR10

Checkpoint directory ./checkpoint/block-CIFAR10-VGG16LP/seed100
Tensorboard loggint at runs/block-CIFAR10-VGG16LP/seed100_08_10_14_16
Prepare data loaders:
Loading dataset CIFAR10 from .
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Prepare quantizers:
Block rounding, W:8, A:8, G:8, E:8, Acc:8
lr init: 0.05
swa start: 200.0 swa lr: 0.01
Model: VGG16LP
Prepare SWA training
Traceback (most recent call last):
  File "train.py", line 189, in <module>
    quantize_momentum=args.quantize_momentum)
  File "/home/user/git/SWALP/utils.py", line 48, in train_batch
    output = model(input_var)
  File "/home/user/anaconda2/envs/swalp_cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/git/SWALP/models/vgg_low.py", line 69, in forward
    x = self.features(x)
  File "/home/user/anaconda2/envs/swalp_cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/anaconda2/envs/swalp_cuda9/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/user/anaconda2/envs/swalp_cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/git/SWALP/models/quantizer.py", line 103, in forward
    self.small_block, self.block_dim)
  File "/home/user/git/SWALP/models/quantizer.py", line 76, in forward
    return block_quantize(x, forward_bits, self.mode, small_block=self.small_block, block_dim=self.block_dim)
  File "/home/user/git/SWALP/models/quantizer.py", line 42, in block_quantize
    max_exponent = math.floor(math.log2(max_entry))
ValueError: cannot convert float NaN to integer

I tried to print out max_entry right before it computes max_exponet:

max_entry: 2.3815226554870605
max_entry: 2.215369701385498
max_entry: 1.9265875816345215
max_entry: 1.8378633260726929
max_entry: 1.3576314449310303
max_entry: 1.2682085037231445
...
max_entry: 1.4677644968032837
max_entry: 1.256148099899292
max_entry: 1.4361257553100586
max_entry: 1.4105850458145142
max_entry: 0.8756170272827148
max_entry: 0.8310933113098145
max_entry: nan

Environment:

(swalp_cuda9) ➜  SWALP git:(master) conda list
# packages in environment at /home/user/anaconda2/envs/swalp_cuda9:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
blas                      1.0                         mkl
ca-certificates           2019.5.15                     1
certifi                   2019.6.16                py36_1
cffi                      1.12.3           py36h2e261b9_0
cuda90                    1.0                  h6433d27_0    pytorch
cudatoolkit               10.0.130                      0
freetype                  2.9.1                h8a8886c_1
intel-openmp              2019.4                      243
jpeg                      9b                   h024ee3a_2
libedit                   3.1.20181209         hc058e9b_0
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_0
libpng                    1.6.37               hbc83047_0
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.0.10               h2733197_2
mkl                       2019.4                      243
mkl_fft                   1.0.12           py36ha843d7b_0
mkl_random                1.0.2            py36hd81dba3_0
ncurses                   6.1                  he6710b0_1
ninja                     1.9.0            py36hfd86e86_0
numpy                     1.16.4           py36h7e9f1db_0
numpy-base                1.16.4           py36hde5b4d6_0
olefile                   0.46                     py36_0
openssl                   1.1.1c               h7b6447c_1
pillow                    6.1.0            py36h34e0f95_0
pip                       19.1.1                   py36_0
protobuf                  3.9.1                    pypi_0    pypi
pycparser                 2.19                     py36_0
python                    3.6.9                h265db76_0
pytorch                   1.2.0           py3.6_cuda10.0.130_cudnn7.6.2_0    pytorch
readline                  7.0                  h7b6447c_5
setuptools                41.0.1                   py36_0
six                       1.12.0                   pypi_0    pypi
sqlite                    3.29.0               h7b6447c_0
tabulate                  0.8.3                    pypi_0    pypi
tensorboardx              1.8                      pypi_0    pypi
tk                        8.6.8                hbc83047_0
torchvision               0.4.0                py36_cu100    pytorch
wheel                     0.33.4                   py36_0
xz                        5.2.4                h14c3975_4
zlib                      1.2.11               h7b6447c_3
zstd                      1.3.7                h0b5b093_0

OS: Ubuntu 18.04 My GPU: GeForce® GTX 1060 3GB

Lucieno avatar Aug 10 '19 06:08 Lucieno

@Lucieno @stevenygd I met the same problem after ten epochs, does it mean the blowup or overflow ?

heckpoint directory ./checkpoints/block-CIFAR10-PreResNet164LP/seed200
Tensorboard loggint at checkpoints/block-CIFAR10-PreResNet164LP/seed200_10_15_18_42
Prepare data loaders:
Loading dataset CIFAR10 from ../data/
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Prepare quantizers:
Block rounding, W:8, A:8, G:8, E:8, Acc:8
lr init: 0.1
swa start: 150.0 swa lr: 0.01
Model: PreResNet164LP
Prepare SWA training
----  --------  ---------  --------  ---------  --------  ------------  --------
  ep        lr    tr_loss    tr_acc    te_loss    te_acc  swa_te_acc        time
----  --------  ---------  --------  ---------  --------  ------------  --------
   1    0.1000     1.6607   37.8340     1.3099   52.2000                440.3035
   2    0.1000     1.1418   58.9820     1.0026   64.3400                438.9215
   3    0.1000     0.9461   66.0900     0.9486   67.0200                442.0348
   4    0.1000     0.8427   70.4760     0.7507   74.1500                443.9436
   5    0.1000     0.7321   74.4340     0.7454   74.2900                461.8002
   6    0.1000     0.6698   76.7520     0.6293   78.5400                508.3268
   7    0.1000     0.6212   78.5380     0.5772   80.2200                507.5976
   8    0.1000     0.5911   79.5060     0.5712   80.1700                509.4609
   9    0.1000     0.5713   80.1780     0.6028   79.0300                508.7512
  10    0.1000     0.5492   81.0900     0.5643   80.5200                508.5259
  11    0.1000     0.5317   81.5480     0.5733   80.5200                504.3482
Traceback (most recent call last):
  File "swa_cifar.py", line 189, in <module>
    quantize_momentum=args.quantize_momentum)
  File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/utils.py", line 48, in train_batch
    output = model(input_var)
  File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/preresnet_low.py", line 152, in forward
    x = self.layer3(x)  # 8x8
  File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/preresnet_low.py", line 91, in forward
    out = self.quant(out)
  File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/quantizer.py", line 103, in forward
    self.small_block, self.block_dim)
  File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/quantizer.py", line 76, in forward
    return block_quantize(x, forward_bits, self.mode, small_block=self.small_block, block_dim=self.block_dim)
  File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/quantizer.py", line 42, in block_quantize
    max_exponent = math.floor(math.log2(max_entry + 1e-32))
ValueError: cannot convert float NaN to integer

jmluu avatar Oct 15 '19 12:10 jmluu

+1 to this problem. having the same issue when trying to run the code.

vmelement avatar Feb 13 '20 15:02 vmelement

download the correct dependencies, they are stated in the readme MAKE SURE to download pytorch 1.0.1 and not 1.0.0

Nader-Merai avatar Apr 16 '20 14:04 Nader-Merai

Hello, So how do you solve this problem? Is there any method to bypass it?

smsskil avatar Dec 20 '20 10:12 smsskil