deep-voice-conversion
deep-voice-conversion copied to clipboard
STUCK at Train1.py", Line 60 : launch_train_with_config(train_conf, trainer=trainer)
Any idea?
[0409 18:16:28 @parallel.py:193] [MultiProcessPrefetchData] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d.
[0409 18:16:28 @argtools.py:146] WRN "import prctl" failed! Install python-prctl so that processes can be cleaned with guarantee.
[0409 18:16:28 @training.py:50] [DataParallel] Training a model of 2 towers.
[0409 18:16:28 @interface.py:43] Automatically applying StagingInput on the DataFlow.
Traceback (most recent call last):
File "train1.py", line 80, in
Were you able to fix this @sallyjoy ? I am also stuck at this!
@sallyjoy what did you give the command as to run this python train1.py code?
I modified the model.py by changing the function name. and the error is gone. but I also encounter some other problem :build_graph() takes exactly 2 arguments (3 given). I suppose I don't have the proper version of tensorpack
I am also stuck at the same point. Anybody knows someway to correct it?
stuck at the same point
@sallyjoy what did you give the command as to run this python train1.py code?
Thanks for replying. I have tried this command : python train1.py case -gpu 0
It looks like the function that is being called in the tensorboard model_desc folder has been depreciated, the body of the function has been totally removed and just throws the error: Link to model_desc.py (see line 136)
For a hack fix, this version/release of tensorpack seems to be compiling: (0.9.0.1) https://github.com/tensorpack/tensorpack/archive/0.9.0.1.zip
For a hack fix, this version/release of tensorpack seems to be compiling: (0.9.0.1) https://github.com/tensorpack/tensorpack/archive/0.9.0.1.zip
Thanks for the suggestion.
I have installed tensorpack 0.9.0.1, the error is gone. Unfortunately, I got other strange errors. Actually, I am testing the code with the following datasets : timit and arctic. Later, if I it works, I am planing to replace arctic with my own dataset.
case: case, logdir: /data/private/vc/logdir/case/train1 /data/private/vc/datasets/timit/TIMIT/TRAIN///*.wav [0428 13:15:37 @logger.py:108] WRN Log directory /data/private/vc/logdir/case/train1 exists! Use 'd' to delete it. [0428 13:15:37 @logger.py:111] WRN If you're resuming from a previous run, you can choose to keep it. Press any other key to exit. Select Action: k (keep) / d (delete) / q (quit):d [0428 13:15:46 @logger.py:73] Argv: train1.py case -gpu 0 [0428 13:15:46 @parallel.py:186] [MultiProcessPrefetchData] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d. [0428 13:15:46 @argtools.py:146] WRN Install python-prctl so that processes can be cleaned with guarantee. [0428 13:15:46 @config.py:165] WRN TrainConfig.nr_tower was deprecated! Set the number of GPUs on the trainer instead! [0428 13:15:47 @config.py:166] WRN See https://github.com/tensorpack/tensorpack/issues/458 for more information. -----OK(---------- [0428 13:15:47 @training.py:52] [DataParallel] Training a model of 2 towers. [0428 13:15:47 @training.py:54] ERR [DataParallel] TensorFlow was not built with CUDA support! [0428 13:15:47 @interface.py:46] Automatically applying StagingInput on the DataFlow. [0428 13:15:47 @develop.py:96] WRN [Deprecated] ModelDescBase._get_inputs() interface will be deprecated after 30 Mar. Use inputs() instead! [0428 13:15:47 @input_source.py:220] Setting up the queue 'QueueInput/input_queue' for CPU prefetching ... [0428 13:15:47 @training.py:112] Building graph for training tower 0 on device /gpu:0 ... [0428 13:15:47 @develop.py:96] WRN [Deprecated] ModelDescBase._build_graph() interface will be deprecated after 30 Mar. Use build_graph() instead! Process _Worker-4: Process _Worker-2: Process _Worker-1: Traceback (most recent call last): Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() ValueError: operands could not be broadcast together with shapes (1,257) (0,) File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) Process _Worker-3: Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) [0428 13:15:48 @develop.py:96] WRN [Deprecated] get_cost() and self.cost will be deprecated after 30 Mar. Return the cost tensor directly in build_graph() instead! [0428 13:15:48 @develop.py:96] WRN [Deprecated] ModelDescBase._get_optimizer() interface will be deprecated after 30 Mar. Use optimizer() instead! [0428 13:15:49 @training.py:112] Building graph for training tower 1 on device /gpu:1 ... [0428 13:15:49 @develop.py:96] WRN [Deprecated] ModelDescBase._build_graph() interface will be deprecated after 30 Mar. Use build_graph() instead! [0428 13:15:49 @develop.py:96] WRN [Deprecated] get_cost() and self.cost will be deprecated after 30 Mar. Return the cost tensor directly in build_graph() instead! [0428 13:15:51 @collection.py:164] These collections were modified but restored in tower1: (tf.GraphKeys.SUMMARIES: 3->5) [0428 13:15:52 @training.py:322] 'sync_variables_from_main_tower' includes 174 operations. [0428 13:15:52 @model_utils.py:64] Trainable Variables: name shape dim
net1/prenet/dense1/kernel:0 [40, 128] 5120
net1/prenet/dense1/bias:0 [128] 128
net1/prenet/dense2/kernel:0 [128, 64] 8192
net1/prenet/dense2/bias:0 [64] 64
net1/cbhg/conv1d_banks/num_1/conv1d/conv1d/kernel:0 [1, 64, 64] 4096
net1/cbhg/conv1d_banks/num_1/normalize/beta:0 [64] 64
net1/cbhg/conv1d_banks/num_1/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_banks/num_2/conv1d/conv1d/kernel:0 [2, 64, 64] 8192
net1/cbhg/conv1d_banks/num_2/normalize/beta:0 [64] 64
net1/cbhg/conv1d_banks/num_2/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_banks/num_3/conv1d/conv1d/kernel:0 [3, 64, 64] 12288
net1/cbhg/conv1d_banks/num_3/normalize/beta:0 [64] 64
net1/cbhg/conv1d_banks/num_3/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_banks/num_4/conv1d/conv1d/kernel:0 [4, 64, 64] 16384
net1/cbhg/conv1d_banks/num_4/normalize/beta:0 [64] 64
net1/cbhg/conv1d_banks/num_4/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_banks/num_5/conv1d/conv1d/kernel:0 [5, 64, 64] 20480
net1/cbhg/conv1d_banks/num_5/normalize/beta:0 [64] 64
net1/cbhg/conv1d_banks/num_5/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_banks/num_6/conv1d/conv1d/kernel:0 [6, 64, 64] 24576
net1/cbhg/conv1d_banks/num_6/normalize/beta:0 [64] 64
net1/cbhg/conv1d_banks/num_6/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_banks/num_7/conv1d/conv1d/kernel:0 [7, 64, 64] 28672
net1/cbhg/conv1d_banks/num_7/normalize/beta:0 [64] 64
net1/cbhg/conv1d_banks/num_7/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_banks/num_8/conv1d/conv1d/kernel:0 [8, 64, 64] 32768
net1/cbhg/conv1d_banks/num_8/normalize/beta:0 [64] 64
net1/cbhg/conv1d_banks/num_8/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_1/conv1d/kernel:0 [3, 512, 64] 98304
net1/cbhg/normalize/beta:0 [64] 64
net1/cbhg/normalize/gamma:0 [64] 64
net1/cbhg/conv1d_2/conv1d/kernel:0 [3, 64, 64] 12288
net1/cbhg/highwaynet_0/dense1/kernel:0 [64, 64] 4096
net1/cbhg/highwaynet_0/dense1/bias:0 [64] 64
net1/cbhg/highwaynet_0/dense2/kernel:0 [64, 64] 4096
net1/cbhg/highwaynet_0/dense2/bias:0 [64] 64
net1/cbhg/highwaynet_1/dense1/kernel:0 [64, 64] 4096
net1/cbhg/highwaynet_1/dense1/bias:0 [64] 64
net1/cbhg/highwaynet_1/dense2/kernel:0 [64, 64] 4096
net1/cbhg/highwaynet_1/dense2/bias:0 [64] 64
net1/cbhg/highwaynet_2/dense1/kernel:0 [64, 64] 4096
net1/cbhg/highwaynet_2/dense1/bias:0 [64] 64
net1/cbhg/highwaynet_2/dense2/kernel:0 [64, 64] 4096
net1/cbhg/highwaynet_2/dense2/bias:0 [64] 64
net1/cbhg/highwaynet_3/dense1/kernel:0 [64, 64] 4096
net1/cbhg/highwaynet_3/dense1/bias:0 [64] 64
net1/cbhg/highwaynet_3/dense2/kernel:0 [64, 64] 4096
net1/cbhg/highwaynet_3/dense2/bias:0 [64] 64
net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/gates/kernel:0 [128, 128] 16384
net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/gates/bias:0 [128] 128
net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/candidate/kernel:0 [128, 64] 8192
net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/candidate/bias:0 [64] 64
net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/gates/kernel:0 [128, 128] 16384
net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/gates/bias:0 [128] 128
net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/candidate/kernel:0 [128, 64] 8192
net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/candidate/bias:0 [64] 64
net1/dense/kernel:0 [128, 61] 7808
net1/dense/bias:0 [61] 61
Total #vars=58, #params=363389, size=1.39MB
[0428 13:15:52 @base.py:209] Setup callbacks graph ...
[0428 13:15:52 @summary.py:38] Maintain moving average summary of 0 tensors in collection MOVING_SUMMARY_OPS.
[0428 13:15:52 @summary.py:75] Summarizing collection 'summaries' of size 3.
[0428 13:15:53 @base.py:227] Creating the session ...
2019-04-28 13:15:53.199861: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2019-04-28 13:15:53.199907: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2019-04-28 13:15:53.199917: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2019-04-28 13:15:53.199929: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2019-04-28 13:15:53.199939: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1297, in _run_fn
self._extend_graph()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1358, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels:
[[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train1.py", line 82, in
[[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]]
Caused by op 'AllReduceGrads/NcclAllReduce_105', defined at:
File "train1.py", line 82, in
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels:
[[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]]
Try running without the GPU flag, I'm pretty sure if you are using tensorflow-gpu, but aren't using multiple gpu's, the tensorflow-gpu commands will automatically use the GPU you have already set up.
Also, I was recieving a lot of strange nccl errors, which I don't need nccl since I am only using one gpu. In train1.py and train2.py I've added
from tensorpack.train.trainers import SimpleTrainer
and then changed the line trainer = SyncMultiGPUTrainerReplicated(hp.train2.num_gpu)
to trainer = SimpleTrainer()
Also, in tensorboard 0.9.0.1 in graph_builder utils.py I had to change the line from tensorflow.contrib import nccl
to from tensorflow.python.ops.nccl_ops import all_sum
then summed = all_sum(grads)
a few lines below. This might not be necessary depending on your tensorflow version.
Also, I was recieving a lot of strange nccl errors, which I don't need nccl since I am only using one gpu. In train1.py and train2.py I've added
from tensorpack.train.trainers import SimpleTrainer
and then changed the linetrainer = SyncMultiGPUTrainerReplicated(hp.train2.num_gpu)
totrainer = SimpleTrainer()
Also, in tensorboard 0.9.0.1 in graph_builder utils.py I had to change the line
from tensorflow.contrib import nccl
tofrom tensorflow.python.ops.nccl_ops import all_sum
thensummed = all_sum(grads)
a few lines below. This might not be necessary depending on your tensorflow version.
I have changed Train1.py like you said, it seems working, no error is displayed. But it is a little strange because It starts Epoch 1 by showing this message that never change over time.
[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
It starts Epoch 1 but no change in the progress bar, never moves to Epoch 2 and no checkpoint stored after about 8 hours execution.
[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
It starts Epoch 1 but no change in the progress bar, never moves to Epoch 2 and no checkpoint stored after about 8 hours execution.
[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
I am also facing the same issue. Please reply if you have been able to solve this issue
Hi guys, After following your solution for this problem, I was also stuck on the same point but then I was able to figure out the solution and start the training. The issue is with .yaml files where data path starts as '/data/private/..' just change it to './data/private/....' by editing those files in the hparams folder. Then if an issue arises for librosa, update librosa to 0.6.2, then it will start working. Maybe this could help you guys.
@Muhammad-MujtabaSaeed I have considered those changes but I am still stuck with this issue. It would be nice of you if you can share the changes you have done so that i can cross check the changes.
@Muhammad-MujtabaSaeed I have considered those changes but I am still stuck with this issue. It would be nice of you if you can share the changes you have done so that i can cross check the changes.
Still stuck. It doesn't change anything for me too.
Anyone figure this out?^
Is there any progress with this issue? Stuck too.
after installing tensorpack 0.9.0.1, another error comes out:
Traceback (most recent call last): File "train1.py", line 78, in <module> train(args, logdir=logdir_train1) File "train1.py", line 60, in train launch_train_with_config(train_conf, trainer=trainer) File "/usr/local/lib/python2.7/dist-packages/tensorpack/train/interface.py", line 90, in launch_train_with_config model.get_input_signature(), input, File "/usr/local/lib/python2.7/dist-packages/tensorpack/utils/argtools.py", line 200, in wrapper value = func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorpack/graph_builder/model_desc.py", line 90, in get_input_signature inputs = self.inputs() File "/usr/local/lib/python2.7/dist-packages/tensorpack/graph_builder/model_desc.py", line 122, in inputs raise NotImplementedError() NotImplementedError
I' not sure if there is some problem with my tensorflow version? Does anyone meet with the same problem? Stucking...
这个问题有什么进展吗?也卡住了
Hello i am also experiencing the same problem now, did you solve it later