kaggle_diabetic icon indicating copy to clipboard operation
kaggle_diabetic copied to clipboard

ValueError: "non finite loss" while running make_kaggle_solution.sh

Open chintak opened this issue 9 years ago • 17 comments

While running python train_nn.py --cnf configs/c_128_5x5_32.py, I got the ValueError. The full error log is attached below. Even after installing lasagne and nolearn at the given commit ids, I'm still getting the deprecation warnings. Could this error be related to it?

Error log

chintak avatar Jan 13 '16 14:01 chintak

If you reduce the learning rate a bit, say to: 'schedule': {0: 0.002, 150: 0.0002, 201: 'stop'} do you still get a non-finite loss?

sveitser avatar Jan 13 '16 14:01 sveitser

Yes, still the same error. Tried with 'schedule': { 0: 0.0005, 150: 0.00005, 201: 'stop', } as well.

chintak avatar Jan 13 '16 17:01 chintak

Also while running the previous convert.py commands there were quite a few files for which "box could not be found" or "box too small" were outputted. In such cases, what is the output image written? I'm wondering if there is a corrupt input image which is causing this. Alternately, any other intermediate values I can print out to debug?

chintak avatar Jan 13 '16 17:01 chintak

That is expected as some images are almost totally black. It just falls back to cropping the center square in those cases.

During the competition I noticed that sometimes the computations slowed down, especially when switching between different versions of theano and I would sometimes have to clear the theano cache and/or reboot my computer. So I would try cleaning the theano cache by running theano-cache clear or by running rm -r ~/.theano and then try again.

If the problem persists, could you also post the output of pip list and pip freeze here?

sveitser avatar Jan 14 '16 04:01 sveitser

/home/ubuntu/dataset/kaggle_diabetic/solution/src/lasagne-master/lasagne/init.py:86: 
UserWarning: The uniform initializer no longer uses Glorot et al.'s approach
to determine the bounds, but defaults to the range (-0.01, 0.01) instead. 
Please use the new GlorotUniform initializer to get the old behavior. 
GlorotUniform is now the default for all layers.

Could this warning be of consequence? Due to improper parameter initialization, we are getting infinite loss.

chintak avatar Jan 14 '16 06:01 chintak

That should be fine. I get these warnings too but we are using the orthogonal initialization. https://github.com/sveitser/kaggle_diabetic/blob/master/layers.py#L36-L37

sveitser avatar Jan 14 '16 08:01 sveitser

Ok.

Output for pip list:

click (3.3)
decorator (4.0.6)
funcsigs (0.4)
ghalton (0.6)
joblib (0.9.3)
Lasagne (0.1.dev0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/lasagne-master)
matplotlib (1.4.3)
mock (1.3.0)
networkx (1.10)
nolearn (0.6a0.dev0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/nolearn-master)
nose (1.3.7)
numpy (1.9.2)
pandas (0.16.0)
pbr (1.8.1)
Pillow (2.7.0)
pip (7.1.2)
pyparsing (2.0.7)
python-dateutil (2.4.2)
pytz (2015.7)
PyYAML (3.11)
scikit-image (0.11.3)
scikit-learn (0.16.1)
scipy (0.15.1)
setuptools (18.2)
SharedArray (0.3)
six (1.10.0)
tabulate (0.7.5)
Theano (0.7.0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/theano)
wheel (0.24.0)

Output for pip freeze:

click==3.3
decorator==4.0.6
funcsigs==0.4
ghalton==0.6
joblib==0.9.3
-e git+https://github.com/benanne/Lasagne.git@9f591a5f3a192028df9947ba1e4903b3b46e8fe0#egg=Lasagne-dev
matplotlib==1.4.3
mock==1.3.0
networkx==1.10
-e git+https://github.com/dnouri/nolearn.git@0a225bc5ad60c76cdc6cccbe866f9b2e39502d10#egg=nolearn-dev
nose==1.3.7
numpy==1.9.2
pandas==0.16.0
pbr==1.8.1
Pillow==2.7.0
pyparsing==2.0.7
python-dateutil==2.4.2
pytz==2015.7
PyYAML==3.11
scikit-image==0.11.3
scikit-learn==0.16.1
scipy==0.15.1
SharedArray==0.3
six==1.10.0
tabulate==0.7.5
-e git+https://github.com/Theano/Theano.git@71a3700fcefd8589728b2b91931debad14c38a3f#egg=Theano-master
wheel==0.24.0

chintak avatar Jan 14 '16 08:01 chintak

Any other changes I can try?

chintak avatar Jan 14 '16 08:01 chintak

I don't have any good ideas right now. What version of cuDNN are you using?

You could try this theano commit instead.

pip install --upgrade -e git+https://github.com/Theano/Theano.git@dfb2730348d05f6aadd116ce492e836a4c0ba6d6#egg=Theano-master

I think it's the one I was using when I was working on the project. Probably best to delete the theano cache again before retrying with another theano version.

sveitser avatar Jan 14 '16 08:01 sveitser

AWS G2 instance which has GRID K520 GPU with CUDA 7.0 and cuDNN v3.0. Nope, the problem still persists.

chintak avatar Jan 14 '16 09:01 chintak

You could insert

print(batch_train_loss[0])

right before this line https://github.com/sveitser/kaggle_diabetic/blob/master/nn.py#L248 to check if the non-finite loss occurs after the initial batch or if it's first finite and then diverges.

sveitser avatar Jan 14 '16 09:01 sveitser

Yep, I had tried that in the beginning. It's "nan" for the first batch itself.

chintak avatar Jan 14 '16 09:01 chintak

Have you tried using any other configurations?

sveitser avatar Jan 15 '16 02:01 sveitser

Yes, I did a fresh install, preprocessed the images again and then ran the train_nn.py for all the given config files - I get "non finite loss" in the very first epoch. I also tried using a batch size of 1. Even then the loss is "nan". Something fundamentally seems to be wrong. Any unit tests for checking lasagne or nolearn? I'm more of a caffe person.

chintak avatar Jan 16 '16 07:01 chintak

Yes both have tests and theano does as well.

For theano (assuming you installed theano with pip previously)

git clone https://github.com/Theano/Theano
cd Theano
theano-nose

For lasagne,

git clone https://github.com/Lasagne/Lasagne
cd Lasagne
pip install -r requirements-dev.txt # comment out first line to not install another theano commit
py.test

For nolearn,

git clone https://github.com/dnouri/nolearn
cd nolearn
py.test

For more info, https://lasagne.readthedocs.org/en/latest/user/development.html#how-to-contribute and http://deeplearning.net/software/theano/extending/unittest.html .

sveitser avatar Jan 16 '16 08:01 sveitser

@chintak Just out of curiosity. Did you manage to get things to work or find out what is going wrong?

sveitser avatar Jan 21 '16 04:01 sveitser

Nope. After a few days I'll try and test it on another system. 

chintak avatar Jan 21 '16 05:01 chintak