SWALP
SWALP copied to clipboard
ValueError: cannot convert float NaN to integer
I am trying to run the example VGG16LP with dataset CIFAR10.
However, the following error occurs:
ValueError: cannot convert float NaN to integer
It would be greatly appreciated if anyone can help.
The details of the execution, error, and the environment are shown as following.
(swalp_cuda9) ➜ SWALP git:(master) ✗ seed=100 # Specify experiment seed.
bash exp/block_vgg_swa.sh CIFAR10 ${seed} # SWALP training on VGG16 with Small-block BFP in CIFAR10
Checkpoint directory ./checkpoint/block-CIFAR10-VGG16LP/seed100
Tensorboard loggint at runs/block-CIFAR10-VGG16LP/seed100_08_10_14_16
Prepare data loaders:
Loading dataset CIFAR10 from .
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Prepare quantizers:
Block rounding, W:8, A:8, G:8, E:8, Acc:8
lr init: 0.05
swa start: 200.0 swa lr: 0.01
Model: VGG16LP
Prepare SWA training
Traceback (most recent call last):
File "train.py", line 189, in <module>
quantize_momentum=args.quantize_momentum)
File "/home/user/git/SWALP/utils.py", line 48, in train_batch
output = model(input_var)
File "/home/user/anaconda2/envs/swalp_cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/git/SWALP/models/vgg_low.py", line 69, in forward
x = self.features(x)
File "/home/user/anaconda2/envs/swalp_cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/anaconda2/envs/swalp_cuda9/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/user/anaconda2/envs/swalp_cuda9/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/git/SWALP/models/quantizer.py", line 103, in forward
self.small_block, self.block_dim)
File "/home/user/git/SWALP/models/quantizer.py", line 76, in forward
return block_quantize(x, forward_bits, self.mode, small_block=self.small_block, block_dim=self.block_dim)
File "/home/user/git/SWALP/models/quantizer.py", line 42, in block_quantize
max_exponent = math.floor(math.log2(max_entry))
ValueError: cannot convert float NaN to integer
I tried to print out max_entry right before it computes max_exponet:
max_entry: 2.3815226554870605
max_entry: 2.215369701385498
max_entry: 1.9265875816345215
max_entry: 1.8378633260726929
max_entry: 1.3576314449310303
max_entry: 1.2682085037231445
...
max_entry: 1.4677644968032837
max_entry: 1.256148099899292
max_entry: 1.4361257553100586
max_entry: 1.4105850458145142
max_entry: 0.8756170272827148
max_entry: 0.8310933113098145
max_entry: nan
Environment:
(swalp_cuda9) ➜ SWALP git:(master) conda list
# packages in environment at /home/user/anaconda2/envs/swalp_cuda9:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
blas 1.0 mkl
ca-certificates 2019.5.15 1
certifi 2019.6.16 py36_1
cffi 1.12.3 py36h2e261b9_0
cuda90 1.0 h6433d27_0 pytorch
cudatoolkit 10.0.130 0
freetype 2.9.1 h8a8886c_1
intel-openmp 2019.4 243
jpeg 9b h024ee3a_2
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.0.10 h2733197_2
mkl 2019.4 243
mkl_fft 1.0.12 py36ha843d7b_0
mkl_random 1.0.2 py36hd81dba3_0
ncurses 6.1 he6710b0_1
ninja 1.9.0 py36hfd86e86_0
numpy 1.16.4 py36h7e9f1db_0
numpy-base 1.16.4 py36hde5b4d6_0
olefile 0.46 py36_0
openssl 1.1.1c h7b6447c_1
pillow 6.1.0 py36h34e0f95_0
pip 19.1.1 py36_0
protobuf 3.9.1 pypi_0 pypi
pycparser 2.19 py36_0
python 3.6.9 h265db76_0
pytorch 1.2.0 py3.6_cuda10.0.130_cudnn7.6.2_0 pytorch
readline 7.0 h7b6447c_5
setuptools 41.0.1 py36_0
six 1.12.0 pypi_0 pypi
sqlite 3.29.0 h7b6447c_0
tabulate 0.8.3 pypi_0 pypi
tensorboardx 1.8 pypi_0 pypi
tk 8.6.8 hbc83047_0
torchvision 0.4.0 py36_cu100 pytorch
wheel 0.33.4 py36_0
xz 5.2.4 h14c3975_4
zlib 1.2.11 h7b6447c_3
zstd 1.3.7 h0b5b093_0
OS: Ubuntu 18.04 My GPU: GeForce® GTX 1060 3GB
@Lucieno @stevenygd I met the same problem after ten epochs, does it mean the blowup or overflow ?
heckpoint directory ./checkpoints/block-CIFAR10-PreResNet164LP/seed200
Tensorboard loggint at checkpoints/block-CIFAR10-PreResNet164LP/seed200_10_15_18_42
Prepare data loaders:
Loading dataset CIFAR10 from ../data/
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Prepare quantizers:
Block rounding, W:8, A:8, G:8, E:8, Acc:8
lr init: 0.1
swa start: 150.0 swa lr: 0.01
Model: PreResNet164LP
Prepare SWA training
---- -------- --------- -------- --------- -------- ------------ --------
ep lr tr_loss tr_acc te_loss te_acc swa_te_acc time
---- -------- --------- -------- --------- -------- ------------ --------
1 0.1000 1.6607 37.8340 1.3099 52.2000 440.3035
2 0.1000 1.1418 58.9820 1.0026 64.3400 438.9215
3 0.1000 0.9461 66.0900 0.9486 67.0200 442.0348
4 0.1000 0.8427 70.4760 0.7507 74.1500 443.9436
5 0.1000 0.7321 74.4340 0.7454 74.2900 461.8002
6 0.1000 0.6698 76.7520 0.6293 78.5400 508.3268
7 0.1000 0.6212 78.5380 0.5772 80.2200 507.5976
8 0.1000 0.5911 79.5060 0.5712 80.1700 509.4609
9 0.1000 0.5713 80.1780 0.6028 79.0300 508.7512
10 0.1000 0.5492 81.0900 0.5643 80.5200 508.5259
11 0.1000 0.5317 81.5480 0.5733 80.5200 504.3482
Traceback (most recent call last):
File "swa_cifar.py", line 189, in <module>
quantize_momentum=args.quantize_momentum)
File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/utils.py", line 48, in train_batch
output = model(input_var)
File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/preresnet_low.py", line 152, in forward
x = self.layer3(x) # 8x8
File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/preresnet_low.py", line 91, in forward
out = self.quant(out)
File "/home/jmlu/anaconda3/envs/ML/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/quantizer.py", line 103, in forward
self.small_block, self.block_dim)
File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/quantizer.py", line 76, in forward
return block_quantize(x, forward_bits, self.mode, small_block=self.small_block, block_dim=self.block_dim)
File "/home/jmlu/Worksapce/ConvNet_Fxp_2.0/models/quantizer.py", line 42, in block_quantize
max_exponent = math.floor(math.log2(max_entry + 1e-32))
ValueError: cannot convert float NaN to integer
+1 to this problem. having the same issue when trying to run the code.
download the correct dependencies, they are stated in the readme MAKE SURE to download pytorch 1.0.1 and not 1.0.0
Hello, So how do you solve this problem? Is there any method to bypass it?