DCGAN-tensorflow
DCGAN-tensorflow copied to clipboard
the training loss is normal?
hi, I am a newbie about tensorflow and I have some confusion about training loss. at the beginning, the d_loss are always bigger than g_loss shown in follow, which is different with the result in the dcgan.torch. it is normal? Epoch: [ 0] [ 0/3165] time: 2.6742, d_loss: 7.06172514, g_loss: 0.00106246 Epoch: [ 0] [ 1/3165] time: 4.7950, d_loss: 6.95885229, g_loss: 0.00125823 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 2487 get requests, put_count=2449 evicted_count=1000 eviction_rate=0.40833 and unsatisfied allocation rate=0.457579 I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110 Epoch: [ 0] [ 2/3165] time: 6.1750, d_loss: 7.25680256, g_loss: 0.00104984 Epoch: [ 0] [ 3/3165] time: 7.5547, d_loss: 6.89718437, g_loss: 0.00364288 Epoch: [ 0] [ 4/3165] time: 8.9374, d_loss: 5.45913506, g_loss: 0.02250301 Epoch: [ 0] [ 5/3165] time: 10.3308, d_loss: 7.75127983, g_loss: 0.00146924 Epoch: [ 0] [ 6/3165] time: 11.7197, d_loss: 4.75904989, g_loss: 0.04762752 Epoch: [ 0] [ 7/3165] time: 13.1084, d_loss: 5.15711403, g_loss: 0.03134135 Epoch: [ 0] [ 8/3165] time: 14.4916, d_loss: 5.35569286, g_loss: 0.04407354 Epoch: [ 0] [ 9/3165] time: 15.8697, d_loss: 4.73206615, g_loss: 0.05500766 Epoch: [ 0] [ 10/3165] time: 17.2533, d_loss: 3.20903492, g_loss: 0.34848747 Epoch: [ 0] [ 11/3165] time: 18.6306, d_loss: 8.54726505, g_loss: 0.00069389 Epoch: [ 0] [ 12/3165] time: 20.0077, d_loss: 1.97646499, g_loss: 2.43928814 Epoch: [ 0] [ 13/3165] time: 21.3865, d_loss: 8.04584980, g_loss: 0.00098451 Epoch: [ 0] [ 14/3165] time: 22.7670, d_loss: 1.93407261, g_loss: 2.32639980 Epoch: [ 0] [ 15/3165] time: 24.1502, d_loss: 8.04065609, g_loss: 0.00085019 Epoch: [ 0] [ 16/3165] time: 25.5354, d_loss: 2.01121569, g_loss: 4.33200264 Epoch: [ 0] [ 17/3165] time: 26.9249, d_loss: 5.53398705, g_loss: 0.01694447 Epoch: [ 0] [ 18/3165] time: 28.3111, d_loss: 1.31883585, g_loss: 5.00018692 Epoch: [ 0] [ 19/3165] time: 29.6976, d_loss: 5.65370369, g_loss: 0.01064641
In addition, only the d_loss is large than g_loss and keep stable, the training process can keep going. However, there will suddenly appear a Nan error during the training process.
Epoch: [ 0] [2344/3165] time: 3322.8894, d_loss: 1.39191413, g_loss: 0.75624585
Epoch: [ 0] [2345/3165] time: 3324.2772, d_loss: 1.60122275, g_loss: 0.36871552
Epoch: [ 0] [2346/3165] time: 3325.6905, d_loss: 1.57876384, g_loss: 0.70225191
Epoch: [ 0] [2347/3165] time: 3327.0963, d_loss: 1.39167571, g_loss: 0.59910935
Epoch: [ 0] [2348/3165] time: 3328.4929, d_loss: 1.43457556, g_loss: 0.60285681
Epoch: [ 0] [2349/3165] time: 3329.8979, d_loss: 1.47647548, g_loss: 0.66025651
Traceback (most recent call last):
File "main.py", line 59, in
any advice? Thanks and Best regards!
Hi. I don;t experience this error. Which dataset are you running your experiments on?
Hi, @sahiliitm thanks for your reply, I work on celebA aligned image dataset and I have trouble raised above every time. Are there differences in platform version, which version of tensorflow you use?
Hi, @sahiliitm I conducted some experiments on the mnist, but aslo encountered the same problem
I used celebA as well. I use tensorflow 0.10 GPU version. I use pip install to install TF.
Hi, @sahiliitm I install the tensorflow 0.80, 0.90 and 0.10 GPU version by pip install, it is pity that the same error occur
Hi, I have encountered the same problem, can you tell me how you solve the problem?
Hi, I have encountered the same problem, can you tell me how you solve the problem?
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
{'batch_size': 64,
'beta1': 0.5,
'c_dim': 3,
'checkpoint_dir': 'checkpoint',
'dataset': 'test',
'epoch': 25,
'input_fname_pattern': '*.jpg',
'input_height': 108,
'input_width': None,
'is_crop': False,
'is_train': False,
'learning_rate': 0.0002,
'output_height': 64,
'output_width': None,
'sample_dir': 'samples',
'train_size': inf,
'visualize': False}
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 980
major: 5 minor: 2 memoryClockRate (GHz) 1.2155
pciBusID 0000:05:00.0
Total memory: 4.00GiB
Free memory: 1.36GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: 0x350d230
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties:
name: Quadro K2000
major: 3 minor: 0 memoryClockRate (GHz) 0.954
pciBusID 0000:04:00.0
Total memory: 2.00GiB
Free memory: 347.90MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y N
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1: N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:826] Ignoring gpu device (device: 1, name: Quadro K2000, pci bus id: 0000:04:00.0) with Cuda multiprocessor count: 2. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
Traceback (most recent call last):
File "main.py", line 96, in
how you solve the problem?
I am also having the same problem. Please let me know if you figured out how to solve it.
Has anyone solved this problem?