DCGAN-tensorflow icon indicating copy to clipboard operation
DCGAN-tensorflow copied to clipboard

the training loss is normal?

Open LinZhineng opened this issue 8 years ago • 10 comments

hi, I am a newbie about tensorflow and I have some confusion about training loss. at the beginning, the d_loss are always bigger than g_loss shown in follow, which is different with the result in the dcgan.torch. it is normal? Epoch: [ 0] [ 0/3165] time: 2.6742, d_loss: 7.06172514, g_loss: 0.00106246 Epoch: [ 0] [ 1/3165] time: 4.7950, d_loss: 6.95885229, g_loss: 0.00125823 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 2487 get requests, put_count=2449 evicted_count=1000 eviction_rate=0.40833 and unsatisfied allocation rate=0.457579 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110 Epoch: [ 0] [ 2/3165] time: 6.1750, d_loss: 7.25680256, g_loss: 0.00104984 Epoch: [ 0] [ 3/3165] time: 7.5547, d_loss: 6.89718437, g_loss: 0.00364288 Epoch: [ 0] [ 4/3165] time: 8.9374, d_loss: 5.45913506, g_loss: 0.02250301 Epoch: [ 0] [ 5/3165] time: 10.3308, d_loss: 7.75127983, g_loss: 0.00146924 Epoch: [ 0] [ 6/3165] time: 11.7197, d_loss: 4.75904989, g_loss: 0.04762752 Epoch: [ 0] [ 7/3165] time: 13.1084, d_loss: 5.15711403, g_loss: 0.03134135 Epoch: [ 0] [ 8/3165] time: 14.4916, d_loss: 5.35569286, g_loss: 0.04407354 Epoch: [ 0] [ 9/3165] time: 15.8697, d_loss: 4.73206615, g_loss: 0.05500766 Epoch: [ 0] [ 10/3165] time: 17.2533, d_loss: 3.20903492, g_loss: 0.34848747 Epoch: [ 0] [ 11/3165] time: 18.6306, d_loss: 8.54726505, g_loss: 0.00069389 Epoch: [ 0] [ 12/3165] time: 20.0077, d_loss: 1.97646499, g_loss: 2.43928814 Epoch: [ 0] [ 13/3165] time: 21.3865, d_loss: 8.04584980, g_loss: 0.00098451 Epoch: [ 0] [ 14/3165] time: 22.7670, d_loss: 1.93407261, g_loss: 2.32639980 Epoch: [ 0] [ 15/3165] time: 24.1502, d_loss: 8.04065609, g_loss: 0.00085019 Epoch: [ 0] [ 16/3165] time: 25.5354, d_loss: 2.01121569, g_loss: 4.33200264 Epoch: [ 0] [ 17/3165] time: 26.9249, d_loss: 5.53398705, g_loss: 0.01694447 Epoch: [ 0] [ 18/3165] time: 28.3111, d_loss: 1.31883585, g_loss: 5.00018692 Epoch: [ 0] [ 19/3165] time: 29.6976, d_loss: 5.65370369, g_loss: 0.01064641

In addition, only the d_loss is large than g_loss and keep stable, the training process can keep going. However, there will suddenly appear a Nan error during the training process. Epoch: [ 0] [2344/3165] time: 3322.8894, d_loss: 1.39191413, g_loss: 0.75624585 Epoch: [ 0] [2345/3165] time: 3324.2772, d_loss: 1.60122275, g_loss: 0.36871552 Epoch: [ 0] [2346/3165] time: 3325.6905, d_loss: 1.57876384, g_loss: 0.70225191 Epoch: [ 0] [2347/3165] time: 3327.0963, d_loss: 1.39167571, g_loss: 0.59910935 Epoch: [ 0] [2348/3165] time: 3328.4929, d_loss: 1.43457556, g_loss: 0.60285681 Epoch: [ 0] [2349/3165] time: 3329.8979, d_loss: 1.47647548, g_loss: 0.66025651 Traceback (most recent call last): File "main.py", line 59, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "main.py", line 43, in main dcgan.train(FLAGS) File "/data/project/DCGAN-tensorflow-master/model.py", line 204, in train feed_dict={ self.z: batch_z }) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 340, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 564, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 637, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 659, in _do_call e.code) tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary_2 [[Node: HistogramSummary_2 = HistogramSummary[T=DT_FLOAT, device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary_2/tag, Sigmoid_1/126)]] Caused by op u'HistogramSummary_2', defined at: File "main.py", line 59, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "main.py", line 40, in main dataset_name=FLAGS.dataset, is_crop=FLAGS.is_crop, checkpoint_dir=FLAGS.checkpoint_dir) File "/data/project/DCGAN-tensorflow-master/model.py", line 69, in init self.build_model() File "/data/project/DCGAN-tensorflow-master/model.py", line 99, in build_model self.d__sum = tf.histogram_summary("d", self.D) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py", line 113, in histogram_summary tag=tag, values=values, name=scope) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 55, in _histogram_summary name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2154, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in init self._traceback = _extract_stack()

any advice? Thanks and Best regards!

LinZhineng avatar Aug 17 '16 13:08 LinZhineng

Hi. I don;t experience this error. Which dataset are you running your experiments on?

sahiliitm avatar Aug 27 '16 22:08 sahiliitm

Hi, @sahiliitm thanks for your reply, I work on celebA aligned image dataset and I have trouble raised above every time. Are there differences in platform version, which version of tensorflow you use?

LinZhineng avatar Aug 29 '16 01:08 LinZhineng

Hi, @sahiliitm I conducted some experiments on the mnist, but aslo encountered the same problem

LinZhineng avatar Aug 29 '16 03:08 LinZhineng

I used celebA as well. I use tensorflow 0.10 GPU version. I use pip install to install TF.

sahiliitm avatar Aug 30 '16 12:08 sahiliitm

Hi, @sahiliitm I install the tensorflow 0.80, 0.90 and 0.10 GPU version by pip install, it is pity that the same error occur

LinZhineng avatar Sep 03 '16 07:09 LinZhineng

Hi, I have encountered the same problem, can you tell me how you solve the problem?

myw8 avatar Nov 28 '16 06:11 myw8

Hi, I have encountered the same problem, can you tell me how you solve the problem?

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally {'batch_size': 64, 'beta1': 0.5, 'c_dim': 3, 'checkpoint_dir': 'checkpoint', 'dataset': 'test', 'epoch': 25, 'input_fname_pattern': '*.jpg', 'input_height': 108, 'input_width': None, 'is_crop': False, 'is_train': False, 'learning_rate': 0.0002, 'output_height': 64, 'output_width': None, 'sample_dir': 'samples', 'train_size': inf, 'visualize': False} I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: name: GeForce GTX 980 major: 5 minor: 2 memoryClockRate (GHz) 1.2155 pciBusID 0000:05:00.0 Total memory: 4.00GiB Free memory: 1.36GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x350d230 I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties: name: Quadro K2000 major: 3 minor: 0 memoryClockRate (GHz) 0.954 pciBusID 0000:04:00.0 Total memory: 2.00GiB Free memory: 347.90MiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 0 to device ordinal 1 I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 1 to device ordinal 0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1 I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y N I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1: N Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:05:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:826] Ignoring gpu device (device: 1, name: Quadro K2000, pci bus id: 0000:04:00.0) with Cuda multiprocessor count: 2. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT. Traceback (most recent call last): File "main.py", line 96, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "main.py", line 76, in main sample_dir=FLAGS.sample_dir) File "/home/qyq/q2017work/Image_Reconstruction/DCGAN-tensorflow-master/model.py", line 75, in init self.build_model() File "/home/qyq/q2017work/Image_Reconstruction/DCGAN-tensorflow-master/model.py", line 111, in build_model self.D_, self.D_logits_ = self.discriminator(self.G, reuse=True) File "/home/qyq/q2017work/Image_Reconstruction/DCGAN-tensorflow-master/model.py", line 327, in discriminator h4 = linear(tf.reshape(h3, [self.batch_size, -1]), 1, 'd_h3_lin') File "/home/qyq/q2017work/Image_Reconstruction/DCGAN-tensorflow-master/ops.py", line 98, in linear tf.random_normal_initializer(stddev=stddev)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 873, in get_variable custom_getter=custom_getter) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 700, in get_variable custom_getter=custom_getter) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 217, in get_variable validate_shape=validate_shape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 202, in _true_getter caching_device=caching_device, validate_shape=validate_shape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 499, in _get_single_variable found_var.get_shape())) ValueError: Trying to share variable discriminator/d_h3_lin/Matrix, but specified shape (8192, 1) and found shape (25088, 1).

yeqingQian avatar Mar 02 '17 06:03 yeqingQian

how you solve the problem?

Ivan-Zhao avatar Apr 06 '17 13:04 Ivan-Zhao

I am also having the same problem. Please let me know if you figured out how to solve it.

maz369 avatar Nov 07 '19 22:11 maz369

Has anyone solved this problem?

huangfeng95 avatar Nov 25 '20 04:11 huangfeng95