Get nan in the losses during training

Open avivples opened this issue 3 years ago • 6 comments

During training in each epoch after a while the losses printed become nan.

Does anyone know what can cause this?

avivples avatar Sep 14 '21 10:09 avivples

Same issue, did'not find solution.

yyyue502 avatar Sep 16 '21 04:09 yyyue502

do you use pytorch 1.9? please use suggested environment.

WongKinYiu avatar Sep 16 '21 04:09 WongKinYiu

No, pytorch 1.5.0 with apex 0.1 is used, any solution for this environment?

yyyue502 avatar Sep 16 '21 04:09 yyyue502


by the way, i have not tested the code on gpu which do not support mixed precision training.

WongKinYiu avatar Sep 16 '21 04:09 WongKinYiu

Is there a way to replace pytorch.amp with apex? Cause my environment is restricted to pytorch 1.5.0 with apex 0.1.

yyyue502 avatar Sep 16 '21 04:09 yyyue502

I am getting nan,too.

I am trying to train a custom model that predicts 6 classes. I am working with satallite images. Therefore, my input channels are not only rgb. I am using 12 channels

here is my environment packages:

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.0.0 pyhd8ed1ab_0 conda-forge aiohttp 3.8.1 py38h0a891b7_1 conda-forge aiosignal 1.2.0 pyhd8ed1ab_0 conda-forge async-timeout 4.0.2 pyhd8ed1ab_0 conda-forge attrs 21.4.0 pyhd8ed1ab_0 conda-forge blas 1.0 mkl
blinker 1.4 py_1 conda-forge brotlipy 0.7.0 py38h0a891b7_1004 conda-forge c-ares 1.18.1 h7f98852_0 conda-forge ca-certificates 2022.4.26 h06a4308_0 anaconda cachetools 5.0.0 pyhd8ed1ab_0 conda-forge certifi 2021.10.8 py38h06a4308_2 anaconda cffi 1.15.0 py38hd667e15_1
charset-normalizer 2.0.12 pyhd8ed1ab_0 conda-forge click 8.1.3 py38h578d9bd_0 conda-forge colorama 0.4.4 pyh9f0ad1d_0 conda-forge cryptography 37.0.2 py38h2b5fc30_0 conda-forge cudatoolkit 9.2 0
cycler 0.11.0 pyhd8ed1ab_0 conda-forge cython 0.29.30 py38hfa26641_0 conda-forge freetype 2.11.0 h70c0345_0
frozenlist 1.3.0 py38h0a891b7_1 conda-forge giflib 5.2.1 h7b6447c_0
google-auth 2.6.6 pyh6c4a22f_0 conda-forge google-auth-oauthlib 0.4.6 pyhd8ed1ab_0 conda-forge grpcio 1.42.0 py38hce63b2e_0
idna 3.3 pyhd8ed1ab_0 conda-forge importlib-metadata 4.11.4 py38h578d9bd_0 conda-forge intel-openmp 2021.4.0 h06a4308_3561
jpeg 9e h7f8727e_0
kiwisolver 1.4.2 py38h43d8883_1 conda-forge lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_0
libgfortran-ng 7.5.0 ha8ba4b0_17 anaconda libgfortran4 7.5.0 ha8ba4b0_17 anaconda libgomp 11.2.0 h1234567_0
libpng 1.6.37 hbc83047_0
libprotobuf 3.15.8 h780b84a_1 conda-forge libstdcxx-ng 11.2.0 h1234567_0
libtiff 4.2.0 h2818925_1
libuv 1.40.0 h7b6447c_0
libwebp 1.2.2 h55f646e_0
libwebp-base 1.2.2 h7f8727e_0
lz4-c 1.9.3 h295c915_1
markdown 3.3.7 pyhd8ed1ab_0 conda-forge matplotlib-base 3.4.3 py38hf4fb855_1 conda-forge mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py38h7f8727e_0
mkl_fft 1.3.1 py38hd3c417c_0
mkl_random 1.2.2 py38h51133e4_0
multidict 6.0.2 py38h0a891b7_1 conda-forge ncurses 6.3 h7f8727e_2
ninja 1.10.2 h06a4308_5
ninja-base 1.10.2 hd09550d_5
numpy 1.22.3 py38he7a7128_0
numpy-base 1.22.3 py38hf524024_0
oauthlib 3.2.0 pyhd8ed1ab_0 conda-forge opencv-python pypi_0 pypi openssl 1.1.1o h7f8727e_0
pillow 9.0.1 py38h22f2fdc_0
pip 21.2.4 py38h06a4308_0
protobuf 3.15.8 py38h709712a_0 conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.7 py_0 conda-forge pycocotools 2.0.4 py38h6c62de6_0 conda-forge pycparser 2.21 pyhd8ed1ab_0 conda-forge pyjwt 2.4.0 pyhd8ed1ab_0 conda-forge pyopenssl 22.0.0 pyhd8ed1ab_0 conda-forge pyparsing 3.0.9 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 py38h578d9bd_5 conda-forge python 3.8.13 h12debd9_0
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python_abi 3.8 2_cp38 conda-forge pytorch 1.7.0 py3.8_cuda9.2.148_cudnn7.6.3_0 pytorch pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge pyyaml 6.0 pypi_0 pypi readline 8.1.2 h7f8727e_1
requests 2.27.1 pyhd8ed1ab_0 conda-forge requests-oauthlib 1.3.1 pyhd8ed1ab_0 conda-forge rsa 4.8 pyhd8ed1ab_0 conda-forge scipy 1.7.3 py38hc147768_0 anaconda setuptools 61.2.0 py38h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.38.3 hc218d9a_0
tensorboard 2.9.0 pyhd8ed1ab_0 conda-forge tensorboard-data-server 0.6.0 py38h2b5fc30_2 conda-forge tensorboard-plugin-wit 1.8.1 pyhd8ed1ab_0 conda-forge tk 8.6.11 h1ccaba5_1
torchaudio 0.7.0 py38 pytorch torchvision 0.8.0 py38_cu92 pytorch tornado 6.1 py38h0a891b7_3 conda-forge tqdm 4.64.0 pyhd8ed1ab_0 conda-forge typing-extensions 4.1.1 hd3eb1b0_0
typing_extensions 4.1.1 pyh06a4308_0
urllib3 1.26.9 pyhd8ed1ab_0 conda-forge werkzeug 2.1.2 pyhd8ed1ab_1 conda-forge wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.5 h7f8727e_1
yaml 0.2.5 h7b6447c_0 anaconda yarl 1.7.2 py38h0a891b7_2 conda-forge zipp 3.8.0 pyhd8ed1ab_0 conda-forge zlib 1.2.12 h7f8727e_2
zstd 1.5.2 ha4553b6_0

MuhammetAkcann avatar May 24 '22 12:05 MuhammetAkcann