Tacotron-2
Tacotron-2 copied to clipboard
Can't Train On Second GPU (Ubuntu 18.04)
Hi, I'm trying to train with 2 GPUs, a 1080 and 1070 (both 8GB) but the training is training almost 4x slower using both and crashing after 3-4 steps.
Here are the last few lines from traning with both, The full log is massive so I put it in a pastebin: https://pastebin.com/raw/fEsStDHe
Exception in thread background:
Traceback (most recent call last):
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
[[{{node datafeeder/eval_queue_enqueue}} = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/eval_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/token_targets_0_6, _arg_datafeeder/linear_targets_0_2, _arg_datafeeder/targets_lengths_0_5, _arg_datafeeder/split_infos_0_4)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/feeder.py", line 177, in _enqueue_next_test_group
self._session.run(self._eval_enqueue_op, feed_dict=feed_dict)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
[[node datafeeder/eval_queue_enqueue (defined at /home/jai/Documents/Tacotron-2-UK-2/tacotron/feeder.py:99) = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/eval_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/token_targets_0_6, _arg_datafeeder/linear_targets_0_2, _arg_datafeeder/targets_lengths_0_5, _arg_datafeeder/split_infos_0_4)]]
Caused by op 'datafeeder/eval_queue_enqueue', defined at:
File "train.py", line 138, in <module>
main()
File "train.py", line 132, in main
train(args, log_dir, hparams)
File "train.py", line 52, in train
checkpoint = tacotron_train(args, log_dir, hparams)
File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 376, in tacotron_train
return train(log_dir, args, hparams)
File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 152, in train
feeder = Feeder(coord, input_path, hparams)
File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/feeder.py", line 99, in __init__
self._eval_enqueue_op = eval_queue.enqueue(self._placeholders)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/ops/data_flow_ops.py", line 341, in enqueue
self._queue_ref, vals, name=scope)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3984, in queue_enqueue_v2
timeout_ms=timeout_ms, name=name)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
CancelledError (see above for traceback): Enqueue operation was cancelled
[[node datafeeder/eval_queue_enqueue (defined at /home/jai/Documents/Tacotron-2-UK-2/tacotron/feeder.py:99) = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/eval_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/token_targets_0_6, _arg_datafeeder/linear_targets_0_2, _arg_datafeeder/targets_lengths_0_5, _arg_datafeeder/split_infos_0_4)]]
Traceback (most recent call last):
File "train.py", line 138, in <module>
main()
File "train.py", line 132, in main
train(args, log_dir, hparams)
File "train.py", line 57, in train
raise('Error occured while training Tacotron, Exiting!')
TypeError: exceptions must derive from BaseException
I've also tried setting gpu_start_idx
to 1
to try training solely on my 1070 but instead I get this error:
initialisation done /gpu:1
Initialized Tacotron model. Dimensions (? = dynamic shape):
Train mode: True
Eval mode: False
GTA mode: False
Synthesis mode: False
Input: (?, ?)
device: 0
embedding: (?, ?, 512)
enc conv out: (?, ?, 512)
encoder out: (?, ?, 512)
decoder out: (?, ?, 80)
residual out: (?, ?, 512)
projected residual out: (?, ?, 80)
mel out: (?, ?, 80)
linear out: (?, ?, 1025)
<stop_token> out: (?, ?)
Tacotron Parameters 29.016 Million.
device: 1
Traceback (most recent call last):
File "train.py", line 138, in <module>
main()
File "train.py", line 132, in main
train(args, log_dir, hparams)
File "train.py", line 52, in train
checkpoint = tacotron_train(args, log_dir, hparams)
File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 376, in tacotron_train
return train(log_dir, args, hparams)
File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 156, in train
model, stats = model_train_mode(args, feeder, hparams, global_step)
File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 87, in model_train_mode
is_training=True, split_infos=feeder.split_infos)
File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/models/tacotron.py", line 247, in initialize
log(' embedding: {}'.format(tower_embedded_inputs[i].shape))
IndexError: list index out of range
I'm running Ubuntu 18.04 (4.15.0-39-generic) with driver 390.87 from the official ppa, CUDA 9.0 and cuNN v7.4.1.5. Any help is appreciated!
@kwibjo I encontered the second problem too, and I think it is a small bug. Try to change line 245 to "for i in range(0, hp.tacotron_num_gpus):" in file tacotron.py, it may solve the problem
Thank you @jarred1989 that fixed crashing on the first few steps.
However, I'm now getting Loss exploded to 19443163401595046756089856.00000 at step 196.00000, avg_loss=1023324389557634032730112.00000]
or Exiting due to exception: Found Inf or NaN global norm. : Tensor had NaN values
https://pastebin.com/raw/PfztfvaD
https://pastebin.com/raw/nn2gaYVL
I have meet 'Found Inf or NaN global norm' too; how to do that?
these are a Issues as you.how to do that ?thanks very much