handwritten-tf-1.0 icon indicating copy to clipboard operation
handwritten-tf-1.0 copied to clipboard

Epoch limit reached

Open rakibhasan48 opened this issue 7 years ago • 4 comments

I am running the following : python train.py --slices 55 --width 12 --stride 1 --Bwidth 350 --vocabulary_size 29
--height 25 --train_data_pattern ./tf-data/handwritten-test-{}.tfrecords --train_dir models-feds
--test_data_pattern ./tf-data/handwritten-test-{}.tfrecords --max_steps 20 --batch_size 20 --beam_size 1
--input_chanels 1 --start_new_model --rnn_cell LSTM --model LSTMCTCModel --num_epochs 6000

Ouput FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters INFO:tensorflow:/job:master/task:0: Tensorflow version: 1.1.0. (8750, '', 25, 350, 1) [20, 25, 350, 1] 0 [20, 25, None, 1] INFO:tensorflow:/job:master/task:0: Removing existing train directory. INFO:tensorflow:/job:master/task:0: Flag 'start_new_model' is set. Building a new model. INFO:tensorflow:Using batch size of 20 for training. tf-data/handwritten-test-{}.tfrecords INFO:tensorflow:Number of training files: 3. (8750, '', 25, 350, 1) (8750, '', 25, 350, 1) INFO:tensorflow:Using batch size of 20 for testing. tf-data/handwritten-test-{}.tfrecords INFO:tensorflow:Number of testing files: 3. (8750, '', 25, 350, 1) (8750, '********************', 25, 350, 1) Tensor("Reshape:0", shape=(20, 25, 350, 1), dtype=float32) [20, 25, 350, 1] 0 [20, 25, None, 1] INFO:tensorflow:/job:master/task:0: Starting managed session. 2018-03-19 18:25:13.630111: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 18:25:13.630151: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 18:25:13.630159: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 18:25:13.630174: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 18:25:13.630180: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 18:25:16.371931: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-03-19 18:25:16.372284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:00:1e.0 Total memory: 11.17GiB Free memory: 11.10GiB 2018-03-19 18:25:16.372318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 2018-03-19 18:25:16.372332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y 2018-03-19 18:25:16.372347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0) INFO:tensorflow:/job:master/task:0: Entering training loop. INFO:tensorflow:global_step/sec: 0 INFO:tensorflow:models-feds/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it. 2018-03-19 18:25:19.788531: W tensorflow/core/kernels/queue_base.cc:302] _3_test_input/shuffle_batch_join/random_shuffle_queue: Skipping cancelled dequeue attempt with queue not closed 2018-03-19 18:25:19.789465: W tensorflow/core/kernels/queue_base.cc:302] _3_test_input/shuffle_batch_join/random_shuffle_queue: Skipping cancelled dequeue attempt with queue not closed INFO:tensorflow:/job:master/task:0: Done training -- epoch limit reached. INFO:tensorflow:/job:master/task:0: Exited training loop.

rakibhasan48 avatar Mar 19 '18 18:03 rakibhasan48

@rakibhasan48 what is your os, and how many GB of ram do you have?

ghost avatar Apr 05 '18 08:04 ghost

I am also getting the same error. What is the solution?

JiteshPshah avatar Apr 17 '18 11:04 JiteshPshah

I tested on aws. With a titan xp, so specs shouldn't be a problem.

rakibhasan48 avatar Apr 17 '18 11:04 rakibhasan48

check out the files = [data_pattern.format(j) for j in range(3)] if nameT=='train' else [data_pattern.format(j) for j in range(3,6)] line from train.py to specify the number of training and testing tfrecords. check if tfrecord has images and labels.

johnsmithm avatar May 05 '18 10:05 johnsmithm