research
research copied to clipboard
Training transition model is too resource intensive, uses too much memory. Possible bug
After training the autoencode, i try to train the transition model as described by the same document.
using
./server.py --time 60 --batch 64
and
./train_generative_model.py transition --batch 64 --name transition
on two different tmux sessions.
Soon (a minute) after running the training command, the process is killed because my memory and swap (16 + 10 GB) are used up, and I'm still on epoch one.
Here is a dump:
/train_generative_model.py transition --batch 64 --name transition [0/0]
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7085
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.58GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
T.shape: (64, 14, 512)
Transition variables:
transition/dreamyrnn_1_W:0
transition/dreamyrnn_1_U:0
transition/dreamyrnn_1_b:0
transition/dreamyrnn_1_V:0
transition/dreamyrnn_1_ext_b:0
Epoch 1/200
Killed
it is super resource intensive yes. I saw elsewhere that Keras does a lot of memory leaks. I used to have a tensorflow only implementation that seemed lighter. But it was less convenient, that was why I opted for Keras in the release.
@kamal94 : Were you able to resolve that issue? I am having the same problem and my train fails sometimes on epoch 1/200 or 2/200 and never goes beyond that. Any suggestions??
how do you train the train_generative_model.py autoencoder successfully ,i meet some difficuty , have to doing somehting in code?
Have you solved this issue? I am having the same problem and my train fails sometimes on epoch 10/200 or 40/200 and never goes beyond that. Any suggestions?
Traceback (most recent call last):
File "./train_generative_model.py", line 168, in