kaggle_diabetic
kaggle_diabetic copied to clipboard
ValueError: "non finite loss" while running make_kaggle_solution.sh
While running python train_nn.py --cnf configs/c_128_5x5_32.py
, I got the ValueError
. The full error log is attached below. Even after installing lasagne and nolearn at the given commit ids, I'm still getting the deprecation warnings. Could this error be related to it?
If you reduce the learning rate a bit, say to: 'schedule': {0: 0.002, 150: 0.0002, 201: 'stop'}
do you still get a non-finite loss?
Yes, still the same error. Tried with 'schedule': { 0: 0.0005, 150: 0.00005, 201: 'stop', }
as well.
Also while running the previous convert.py
commands there were quite a few files for which "box could not be found" or "box too small" were outputted. In such cases, what is the output image written? I'm wondering if there is a corrupt input image which is causing this. Alternately, any other intermediate values I can print out to debug?
That is expected as some images are almost totally black. It just falls back to cropping the center square in those cases.
During the competition I noticed that sometimes the computations slowed down, especially when switching between different versions of theano and I would sometimes have to clear the theano cache and/or reboot my computer. So I would try cleaning the theano cache by running theano-cache clear
or by running rm -r ~/.theano
and then try again.
If the problem persists, could you also post the output of pip list
and pip freeze
here?
/home/ubuntu/dataset/kaggle_diabetic/solution/src/lasagne-master/lasagne/init.py:86:
UserWarning: The uniform initializer no longer uses Glorot et al.'s approach
to determine the bounds, but defaults to the range (-0.01, 0.01) instead.
Please use the new GlorotUniform initializer to get the old behavior.
GlorotUniform is now the default for all layers.
Could this warning be of consequence? Due to improper parameter initialization, we are getting infinite loss.
That should be fine. I get these warnings too but we are using the orthogonal initialization. https://github.com/sveitser/kaggle_diabetic/blob/master/layers.py#L36-L37
Ok.
Output for pip list
:
click (3.3)
decorator (4.0.6)
funcsigs (0.4)
ghalton (0.6)
joblib (0.9.3)
Lasagne (0.1.dev0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/lasagne-master)
matplotlib (1.4.3)
mock (1.3.0)
networkx (1.10)
nolearn (0.6a0.dev0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/nolearn-master)
nose (1.3.7)
numpy (1.9.2)
pandas (0.16.0)
pbr (1.8.1)
Pillow (2.7.0)
pip (7.1.2)
pyparsing (2.0.7)
python-dateutil (2.4.2)
pytz (2015.7)
PyYAML (3.11)
scikit-image (0.11.3)
scikit-learn (0.16.1)
scipy (0.15.1)
setuptools (18.2)
SharedArray (0.3)
six (1.10.0)
tabulate (0.7.5)
Theano (0.7.0, /home/ubuntu/dataset/kaggle_diabetic/solution/src/theano)
wheel (0.24.0)
Output for pip freeze
:
click==3.3
decorator==4.0.6
funcsigs==0.4
ghalton==0.6
joblib==0.9.3
-e git+https://github.com/benanne/Lasagne.git@9f591a5f3a192028df9947ba1e4903b3b46e8fe0#egg=Lasagne-dev
matplotlib==1.4.3
mock==1.3.0
networkx==1.10
-e git+https://github.com/dnouri/nolearn.git@0a225bc5ad60c76cdc6cccbe866f9b2e39502d10#egg=nolearn-dev
nose==1.3.7
numpy==1.9.2
pandas==0.16.0
pbr==1.8.1
Pillow==2.7.0
pyparsing==2.0.7
python-dateutil==2.4.2
pytz==2015.7
PyYAML==3.11
scikit-image==0.11.3
scikit-learn==0.16.1
scipy==0.15.1
SharedArray==0.3
six==1.10.0
tabulate==0.7.5
-e git+https://github.com/Theano/Theano.git@71a3700fcefd8589728b2b91931debad14c38a3f#egg=Theano-master
wheel==0.24.0
Any other changes I can try?
I don't have any good ideas right now. What version of cuDNN are you using?
You could try this theano commit instead.
pip install --upgrade -e git+https://github.com/Theano/Theano.git@dfb2730348d05f6aadd116ce492e836a4c0ba6d6#egg=Theano-master
I think it's the one I was using when I was working on the project. Probably best to delete the theano cache again before retrying with another theano version.
AWS G2 instance which has GRID K520 GPU with CUDA 7.0 and cuDNN v3.0. Nope, the problem still persists.
You could insert
print(batch_train_loss[0])
right before this line https://github.com/sveitser/kaggle_diabetic/blob/master/nn.py#L248 to check if the non-finite loss occurs after the initial batch or if it's first finite and then diverges.
Yep, I had tried that in the beginning. It's "nan" for the first batch itself.
Have you tried using any other configurations?
Yes, I did a fresh install, preprocessed the images again and then ran the train_nn.py for all the given config files - I get "non finite loss" in the very first epoch. I also tried using a batch size of 1. Even then the loss is "nan". Something fundamentally seems to be wrong. Any unit tests for checking lasagne or nolearn? I'm more of a caffe person.
Yes both have tests and theano does as well.
For theano (assuming you installed theano with pip previously)
git clone https://github.com/Theano/Theano
cd Theano
theano-nose
For lasagne,
git clone https://github.com/Lasagne/Lasagne
cd Lasagne
pip install -r requirements-dev.txt # comment out first line to not install another theano commit
py.test
For nolearn,
git clone https://github.com/dnouri/nolearn
cd nolearn
py.test
For more info, https://lasagne.readthedocs.org/en/latest/user/development.html#how-to-contribute and http://deeplearning.net/software/theano/extending/unittest.html .
@chintak Just out of curiosity. Did you manage to get things to work or find out what is going wrong?
Nope. After a few days I'll try and test it on another system.