pytorch-best-practice icon indicating copy to clipboard operation
pytorch-best-practice copied to clipboard

发生浮点数溢出问题

Open foolishflyfox opened this issue 6 years ago • 1 comments

在执行的过程中发生了数据溢出,下面是执行过程中的输出:

python main.py train --train-data-root=/home/linux_fhb/data/cat_vs_dog/train --use-gpu --env=classifier
user config:
env classifier
model ResNet34
train_data_root /home/linux_fhb/data/cat_vs_dog/train
test_data_root ./data/test1
load_model_path None
batch_size 32
use_gpu True
num_workers 4
print_freq 20
debug_file /tmp/debug
result_file result.csv
max_epoch 10
lr 0.1
lr_decay 0.95
weight_decay 0.0001
parse <bound method parse of <config.DefaultConfig object at 0x7f3e4a85b400>>
/home/linux_fhb/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py:188: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  "please use transforms.Resize instead.")
/home/linux_fhb/anaconda3/lib/python3.6/site-packages/torchvision/transforms/transforms.py:563: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  "please use transforms.RandomResizedCrop instead.")
  0%|                                                 | 0/17500 [00:00<?, ?it/s]main.py:99: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  loss_meter.add(loss.data[0])
  3%|█▏                                   | 547/17500 [02:09<1:05:07,  4.34it/s]
main.py:138: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  val_input = Variable(input, volatile=True)
main.py:139: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  val_label = Variable(label.type(t.LongTensor), volatile=True)
Traceback (most recent call last):
  File "main.py", line 171, in <module>
    fire.Fire()
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/linux_fhb/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "main.py", line 121, in train
    if loss_meter.value()[0] > previous_loss:          
RuntimeError: value cannot be converted to type float without overflow: 10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104.000000

其中环境的版本号为:

Python 3.6.5 :: Anaconda, Inc.
fire                               0.1.3    
numpy                              1.14.3   
numpydoc                           0.8.0    
torch                              0.4.1    
torchfile                          0.1.0    
torchnet                           0.0.4    
torchvision                        0.2.1    
visdom                             0.1.8.5  

显卡版本为:NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1), 11G 显存;

有遇到相同问题的兄弟吗?你们是怎么解决的?

foolishflyfox avatar Dec 07 '18 11:12 foolishflyfox

改一下这个值:previous_loss

lijie2160 avatar Jan 30 '19 04:01 lijie2160