multi-class-text-classification-cnn-rnn icon indicating copy to clipboard operation
multi-class-text-classification-cnn-rnn copied to clipboard

Training fails

Open Tomas0413 opened this issue 8 years ago • 11 comments

Hi, I'm having this issue when I run training:

python3 train.py ./data/train.csv.zip ./training_config.json

CRITICAL:root:Accuracy on test set: 0.9971641706053186 Traceback (most recent call last): File "train.py", line 161, in train_cnn_rnn() File "train.py", line 151, in train_cnn_rnn os.rename(path, trained_dir + 'best_model.ckpt') FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_1486165230/model-2700' -> './trained_results_1486165230/best_model.ckpt'

I'll spend a bit of time tomorrow to see how t fix this problem.

Tomas0413 avatar Feb 03 '17 23:02 Tomas0413

Did you check the saved model directory? Looks like model-2700 doesn't exist.

os.rename(path, trained_dir + 'best_model.ckpt') FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_1486165230/model-2700' -> './trained_results_1486165230/best_model.ckpt'

jiegzhan avatar Feb 04 '17 01:02 jiegzhan

@jiegzhan yes, model-2700 files do exist. but there is no model-2700 file as such nor it's a directory:

ls -lrt ./checkpoints_1486165230/ total 71404 -rw-r--r-- 1 root root 1433 Feb 3 23:41 model-1600.index -rw-r--r-- 1 root root 13073080 Feb 3 23:41 model-1600.data-00000-of-00001 -rw-r--r-- 1 root root 1543734 Feb 3 23:41 model-1600.meta -rw-r--r-- 1 root root 1433 Feb 3 23:41 model-1700.index -rw-r--r-- 1 root root 13073080 Feb 3 23:41 model-1700.data-00000-of-00001 -rw-r--r-- 1 root root 1543734 Feb 3 23:41 model-1700.meta -rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2200.index -rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2200.data-00000-of-00001 -rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2200.meta -rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2400.index -rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2400.data-00000-of-00001 -rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2400.meta -rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2700.index -rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2700.data-00000-of-00001 -rw-r--r-- 1 root root 241 Feb 3 23:42 checkpoint -rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2700.meta

I'll have a look if train.py didn't write something correctly or if os.rename command is incorrect.

Tomas0413 avatar Feb 04 '17 10:02 Tomas0413

python3 -c 'import tensorflow as tf; print(tf.version)'
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally 0.12.1

Tomas0413 avatar Feb 04 '17 10:02 Tomas0413

My tensorflow version is 0.9, it only produce two training files.

-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2700.index -rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2700.meta -rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2700.data-00000-of-00001

The newer version has three training files, instead of two.

jiegzhan avatar Feb 04 '17 18:02 jiegzhan

Do you get more files created in checkpoints directory? I see *.meta, *.index, .data- and checkpoint.

Tomas0413 avatar Feb 04 '17 18:02 Tomas0413

My tensorflow version is 0.9, it only produces two training files.

jiegzhan avatar Feb 04 '17 18:02 jiegzhan

Found this: https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md

New checkpoint format becomes the default in tf.train.Saver. Old V1 checkpoints continue to be readable; controlled by the write_version argument, tf.train.Saver now by default writes out in the new V2 format. It significantly reduces the peak memory required and latency incurred during restore.

Tomas0413 avatar Feb 04 '17 18:02 Tomas0413

set up the write_version argument if you are in a hurry.

I will try to upgrade the tensorflow and make changes soon.

Thanks for pointing this out.

jiegzhan avatar Feb 04 '17 18:02 jiegzhan

Yep, testing it with V1 now.

Tomas0413 avatar Feb 04 '17 18:02 Tomas0413

Yep, works fine with :

saver = tf.train.Saver(tf.all_variables(), write_version=tf.train.SaverDef.V1)

This is the how the warning message looks like:

WARNING:tensorflow:******************************************************* WARNING:tensorflow:******************************************************* WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated. WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated. WARNING:tensorflow:Consider switching to the more efficient V2 format: WARNING:tensorflow:Consider switching to the more efficient V2 format: WARNING:tensorflow: tf.train.Saver(write_version=tf.train.SaverDef.V2) WARNING:tensorflow: tf.train.Saver(write_version=tf.train.SaverDef.V2) WARNING:tensorflow:now on by default. WARNING:tensorflow:now on by default. WARNING:tensorflow:******************************************************* WARNING:tensorflow:*******************************************************

Thanks

Tomas0413 avatar Feb 04 '17 18:02 Tomas0413

Hai Guys Any Solution for the above issue . If yes please reply.

ghost avatar Aug 30 '17 04:08 ghost