multi-class-text-classification-cnn-rnn Training fails

Hi, I'm having this issue when I run training:

python3 train.py ./data/train.csv.zip ./training_config.json

CRITICAL:root:Accuracy on test set: 0.9971641706053186 Traceback (most recent call last): File "train.py", line 161, in train_cnn_rnn() File "train.py", line 151, in train_cnn_rnn os.rename(path, trained_dir + 'best_model.ckpt') FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_1486165230/model-2700' -> './trained_results_1486165230/best_model.ckpt'

I'll spend a bit of time tomorrow to see how t fix this problem.

Feb 03 '17 23:02 Tomas0413

Did you check the saved model directory? Looks like model-2700 doesn't exist.

os.rename(path, trained_dir + 'best_model.ckpt') FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_1486165230/model-2700' -> './trained_results_1486165230/best_model.ckpt'

Feb 04 '17 01:02 jiegzhan

@jiegzhan yes, model-2700 files do exist. but there is no model-2700 file as such nor it's a directory:

ls -lrt ./checkpoints_1486165230/ total 71404 -rw-r--r-- 1 root root 1433 Feb 3 23:41 model-1600.index -rw-r--r-- 1 root root 13073080 Feb 3 23:41 model-1600.data-00000-of-00001 -rw-r--r-- 1 root root 1543734 Feb 3 23:41 model-1600.meta -rw-r--r-- 1 root root 1433 Feb 3 23:41 model-1700.index -rw-r--r-- 1 root root 13073080 Feb 3 23:41 model-1700.data-00000-of-00001 -rw-r--r-- 1 root root 1543734 Feb 3 23:41 model-1700.meta -rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2200.index -rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2200.data-00000-of-00001 -rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2200.meta -rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2400.index -rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2400.data-00000-of-00001 -rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2400.meta -rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2700.index -rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2700.data-00000-of-00001 -rw-r--r-- 1 root root 241 Feb 3 23:42 checkpoint -rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2700.meta

I'll have a look if train.py didn't write something correctly or if os.rename command is incorrect.

Feb 04 '17 10:02 Tomas0413

python3 -c 'import tensorflow as tf; print(tf.version)'
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally 0.12.1

Feb 04 '17 10:02 Tomas0413

My tensorflow version is 0.9, it only produce two training files.

-rw-r--r-- 1 root root 1433 Feb 3 23:42 model-2700.index -rw-r--r-- 1 root root 1543734 Feb 3 23:42 model-2700.meta -rw-r--r-- 1 root root 13073080 Feb 3 23:42 model-2700.data-00000-of-00001

The newer version has three training files, instead of two.

Feb 04 '17 18:02 jiegzhan

Do you get more files created in checkpoints directory? I see *.meta, *.index, .data- and checkpoint.

Feb 04 '17 18:02 Tomas0413

My tensorflow version is 0.9, it only produces two training files.

Feb 04 '17 18:02 jiegzhan

Found this: https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md

New checkpoint format becomes the default in tf.train.Saver. Old V1 checkpoints continue to be readable; controlled by the write_version argument, tf.train.Saver now by default writes out in the new V2 format. It significantly reduces the peak memory required and latency incurred during restore.

Feb 04 '17 18:02 Tomas0413

set up the write_version argument if you are in a hurry.

I will try to upgrade the tensorflow and make changes soon.

Thanks for pointing this out.

Feb 04 '17 18:02 jiegzhan

Yep, testing it with V1 now.

Feb 04 '17 18:02 Tomas0413

Yep, works fine with :

saver = tf.train.Saver(tf.all_variables(), write_version=tf.train.SaverDef.V1)

This is the how the warning message looks like:

WARNING:tensorflow:******************************************************* WARNING:tensorflow:******************************************************* WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated. WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated. WARNING:tensorflow:Consider switching to the more efficient V2 format: WARNING:tensorflow:Consider switching to the more efficient V2 format: WARNING:tensorflow: tf.train.Saver(write_version=tf.train.SaverDef.V2) WARNING:tensorflow: tf.train.Saver(write_version=tf.train.SaverDef.V2) WARNING:tensorflow:now on by default. WARNING:tensorflow:now on by default. WARNING:tensorflow:******************************************************* WARNING:tensorflow:*******************************************************

Thanks

Feb 04 '17 18:02 Tomas0413

Hai Guys Any Solution for the above issue . If yes please reply.

Aug 30 '17 04:08 ghost

multi-class-text-classification-cnn-rnn multi-class-text-classification-cnn-rnn copied to clipboard

Training fails

multi-class-text-classification-cnn-rnn
multi-class-text-classification-cnn-rnn copied to clipboard