第三章 在GPU上面跑出现以下问题
[julyedu_455140@julyedu-gpu slim]$ python train_image_classifier.py \
--train_dir=satellite/train_dir
--dataset_name=satellite
--dataset_split_name=train
--dataset_dir=satellite/data
--model_name=inception_v3
--checkpoint_path=satellite/pretrained/inception_v3.ckpt
--checkpoint_exclude_scopes=InceptionV3/Logits,InceptionV3/AuxLogits
--max_number_of_steps=100000
--batch_size=32
--learning_rate=0.001
--learning_rate_decay_type=fixed
--save_interval_secs=300
--save_summaries_secs=10
--log_every_n_steps=1
--optimizer=rmsprop
--weight_decay=0.00004 WARNING:tensorflow:From train_image_classifier.py:398: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead. INFO:tensorflow:Fine-tuning from satellite/pretrained/inception_v3.ckpt WARNING:tensorflow:From /usr/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py:737: init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2018-11-05 23:27:03.944383: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-11-05 23:27:04.173351: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 17071734784 INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, Failed to create session. Traceback (most recent call last): File "train_image_classifier.py", line 573, intf.app.run() File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_image_classifier.py", line 569, in main sync_optimizer=optimizer if FLAGS.sync_replicas else None) File "/usr/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 748, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib64/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 1005, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 833, in stop ignore_live_threads=ignore_live_threads) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 994, in managed_session start_standard_services=start_standard_services) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 731, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session config=config) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 184, in _restore_checkpoint sess = session.Session(self._target, graph=self._graph, config=config) File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1563, in init super(Session, self).init(target, graph, config=config) File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 633, in init self._session = tf_session.TF_NewSession(self._graph._c_graph, opts) tensorflow.python.framework.errors_impl.InternalError: Failed to create session. [julyedu_455140@julyedu-gpu slim]$
同求如何解决
怎么解决
成功解决 config = tf.ConfigProto(allow_soft_placement=True) slim.learning.train( train_tensor, logdir=FLAGS.train_dir, master=FLAGS.master, is_chief=(FLAGS.task == 0), init_fn=_get_init_fn(), summary_op=summary_op, number_of_steps=FLAGS.max_number_of_steps, log_every_n_steps=FLAGS.log_every_n_steps, save_summaries_secs=FLAGS.save_summaries_secs, save_interval_secs=FLAGS.save_interval_secs, sync_optimizer=optimizer if FLAGS.sync_replicas else None , session_config= config)
成功解决在GPU上的训练错误 INFO:tensorflow:Recording summary at step 191. INFO:tensorflow:Recording summary at step 198. INFO:tensorflow:global step 200: loss = 1.9391 (0.905 sec/step) INFO:tensorflow:Recording summary at step 206. INFO:tensorflow:global step 210: loss = 1.8334 (0.227 sec/step) INFO:tensorflow:Recording summary at step 214. INFO:tensorflow:global step 220: loss = 1.7586 (0.221 sec/step) INFO:tensorflow:Recording summary at step 222. INFO:tensorflow:global step 230: loss = 1.6027 (0.227 sec/step) INFO:tensorflow:Recording summary at step 230. INFO:tensorflow:Recording summary at step 237. INFO:tensorflow:global step 240: loss = 1.7777 (0.283 sec/step) INFO:tensorflow:Recording summary at step 245. INFO:tensorflow:global step 250: loss = 1.9851 (0.223 sec/step) INFO:tensorflow:Recording summary at step 253. INFO:tensorflow:global step 260: loss = 1.5877 (0.228 sec/step) INFO:tensorflow:Recording summary at step 260. INFO:tensorflow:Recording summary at step 268. INFO:tensorflow:global step 270: loss = 1.5395 (1.007 sec/step) INFO:tensorflow:Recording summary at step 276. INFO:tensorflow:global step 280: loss = 1.5963 (0.224 sec/step) INFO:tensorflow:Recording summary at step 283. INFO:tensorflow:global step 290: loss = 1.5380 (0.223 sec/step) INFO:tensorflow:Recording summary at step 291. INFO:tensorflow:Recording summary at step 298. INFO:tensorflow:global step 300: loss = 1.8641 (1.101 sec/step) INFO:tensorflow:Recording summary at step 306. INFO:tensorflow:global step 310: loss = 1.9054 (0.227 sec/step) INFO:tensorflow:Recording summary at step 312. ########################### # Kicks off the training. # ########################### 在这个部分修改两个地方: 1.添加一行新代码:config = tf.ConfigProto(allow_soft_placement=True) 2.在slim.learning.train()中加入一条新的参数:session_config= config