Tacotron-2 icon indicating copy to clipboard operation
Tacotron-2 copied to clipboard

Can't Train On Second GPU (Ubuntu 18.04)

Open jaimu97 opened this issue 6 years ago • 4 comments

Hi, I'm trying to train with 2 GPUs, a 1080 and 1070 (both 8GB) but the training is training almost 4x slower using both and crashing after 3-4 steps.

Here are the last few lines from traning with both, The full log is massive so I put it in a pastebin: https://pastebin.com/raw/fEsStDHe

Exception in thread background:
Traceback (most recent call last):
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
         [[{{node datafeeder/eval_queue_enqueue}} = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/eval_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/token_targets_0_6, _arg_datafeeder/linear_targets_0_2, _arg_datafeeder/targets_lengths_0_5, _arg_datafeeder/split_infos_0_4)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/feeder.py", line 177, in _enqueue_next_test_group
    self._session.run(self._eval_enqueue_op, feed_dict=feed_dict)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
         [[node datafeeder/eval_queue_enqueue (defined at /home/jai/Documents/Tacotron-2-UK-2/tacotron/feeder.py:99)  = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/eval_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/token_targets_0_6, _arg_datafeeder/linear_targets_0_2, _arg_datafeeder/targets_lengths_0_5, _arg_datafeeder/split_infos_0_4)]]

Caused by op 'datafeeder/eval_queue_enqueue', defined at:
  File "train.py", line 138, in <module>
    main()
  File "train.py", line 132, in main
    train(args, log_dir, hparams)
  File "train.py", line 52, in train
    checkpoint = tacotron_train(args, log_dir, hparams)
  File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 376, in tacotron_train
    return train(log_dir, args, hparams)
  File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 152, in train
    feeder = Feeder(coord, input_path, hparams)
  File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/feeder.py", line 99, in __init__
    self._eval_enqueue_op = eval_queue.enqueue(self._placeholders)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/ops/data_flow_ops.py", line 341, in enqueue
    self._queue_ref, vals, name=scope)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3984, in queue_enqueue_v2
    timeout_ms=timeout_ms, name=name)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/jai/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

CancelledError (see above for traceback): Enqueue operation was cancelled
         [[node datafeeder/eval_queue_enqueue (defined at /home/jai/Documents/Tacotron-2-UK-2/tacotron/feeder.py:99)  = QueueEnqueueV2[Tcomponents=[DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](datafeeder/eval_queue, _arg_datafeeder/inputs_0_1, _arg_datafeeder/input_lengths_0_0, _arg_datafeeder/mel_targets_0_3, _arg_datafeeder/token_targets_0_6, _arg_datafeeder/linear_targets_0_2, _arg_datafeeder/targets_lengths_0_5, _arg_datafeeder/split_infos_0_4)]]


Traceback (most recent call last):
  File "train.py", line 138, in <module>
    main()
  File "train.py", line 132, in main
    train(args, log_dir, hparams)
  File "train.py", line 57, in train
    raise('Error occured while training Tacotron, Exiting!')
TypeError: exceptions must derive from BaseException

I've also tried setting gpu_start_idx to 1 to try training solely on my 1070 but instead I get this error:

initialisation done /gpu:1
Initialized Tacotron model. Dimensions (? = dynamic shape): 
  Train mode:               True
  Eval mode:                False
  GTA mode:                 False
  Synthesis mode:           False
  Input:                    (?, ?)
  device:                   0
  embedding:                (?, ?, 512)
  enc conv out:             (?, ?, 512)
  encoder out:              (?, ?, 512)
  decoder out:              (?, ?, 80)
  residual out:             (?, ?, 512)
  projected residual out:   (?, ?, 80)
  mel out:                  (?, ?, 80)
  linear out:               (?, ?, 1025)
  <stop_token> out:         (?, ?)
  Tacotron Parameters       29.016 Million.
  device:                   1
Traceback (most recent call last):
  File "train.py", line 138, in <module>
    main()
  File "train.py", line 132, in main
    train(args, log_dir, hparams)
  File "train.py", line 52, in train
    checkpoint = tacotron_train(args, log_dir, hparams)
  File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 376, in tacotron_train
    return train(log_dir, args, hparams)
  File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 156, in train
    model, stats = model_train_mode(args, feeder, hparams, global_step)
  File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/train.py", line 87, in model_train_mode
    is_training=True, split_infos=feeder.split_infos)
  File "/home/jai/Documents/Tacotron-2-UK-2/tacotron/models/tacotron.py", line 247, in initialize
    log('  embedding:                {}'.format(tower_embedded_inputs[i].shape))
IndexError: list index out of range

I'm running Ubuntu 18.04 (4.15.0-39-generic) with driver 390.87 from the official ppa, CUDA 9.0 and cuNN v7.4.1.5. Any help is appreciated!

jaimu97 avatar Nov 15 '18 03:11 jaimu97

@kwibjo I encontered the second problem too, and I think it is a small bug. Try to change line 245 to "for i in range(0, hp.tacotron_num_gpus):" in file tacotron.py, it may solve the problem

jarred1989 avatar Nov 16 '18 01:11 jarred1989

Thank you @jarred1989 that fixed crashing on the first few steps.

However, I'm now getting Loss exploded to 19443163401595046756089856.00000 at step 196.00000, avg_loss=1023324389557634032730112.00000] or Exiting due to exception: Found Inf or NaN global norm. : Tensor had NaN values https://pastebin.com/raw/PfztfvaD https://pastebin.com/raw/nn2gaYVL

jaimu97 avatar Nov 17 '18 09:11 jaimu97

I have meet 'Found Inf or NaN global norm' too; how to do that?

freecui avatar Nov 13 '19 08:11 freecui

these are a Issues as you.how to do that ?thanks very much Screenshot from 2020-01-13 13-48-05

15755841658 avatar Jan 13 '20 05:01 15755841658