deep-voice-conversion icon indicating copy to clipboard operation
deep-voice-conversion copied to clipboard

STUCK at Train1.py", Line 60 : launch_train_with_config(train_conf, trainer=trainer)

Open sallyjoy opened this issue 5 years ago • 21 comments

Any idea?

[0409 18:16:28 @parallel.py:193] [MultiProcessPrefetchData] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d. [0409 18:16:28 @argtools.py:146] WRN "import prctl" failed! Install python-prctl so that processes can be cleaned with guarantee. [0409 18:16:28 @training.py:50] [DataParallel] Training a model of 2 towers. [0409 18:16:28 @interface.py:43] Automatically applying StagingInput on the DataFlow. Traceback (most recent call last): File "train1.py", line 80, in train(args, logdir=logdir_train1) File "train1.py", line 60, in train launch_train_with_config(train_conf, trainer=trainer) File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/interface.py", line 90, in launch_train_with_config model.get_input_signature(), input, File "/usr/local/lib/python3.6/dist-packages/tensorpack/utils/argtools.py", line 200, in wrapper value = func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorpack/graph_builder/model_desc.py", line 86, in get_input_signature inputs = self.inputs() File "/usr/local/lib/python3.6/dist-packages/tensorpack/graph_builder/model_desc.py", line 116, in inputs raise NotImplementedError() NotImplementedError

sallyjoy avatar Apr 09 '19 15:04 sallyjoy

Were you able to fix this @sallyjoy ? I am also stuck at this!

ash13 avatar Apr 14 '19 23:04 ash13

@sallyjoy what did you give the command as to run this python train1.py code?

YashBangera7 avatar Apr 15 '19 21:04 YashBangera7

I modified the model.py by changing the function name. and the error is gone. but I also encounter some other problem :build_graph() takes exactly 2 arguments (3 given). I suppose I don't have the proper version of tensorpack

hallcacrx avatar Apr 23 '19 08:04 hallcacrx

I am also stuck at the same point. Anybody knows someway to correct it?

kushmisra avatar Apr 23 '19 12:04 kushmisra

stuck at the same point

sebasdeldi avatar Apr 24 '19 01:04 sebasdeldi

@sallyjoy what did you give the command as to run this python train1.py code?

Thanks for replying. I have tried this command : python train1.py case -gpu 0

sallyjoy avatar Apr 27 '19 19:04 sallyjoy

It looks like the function that is being called in the tensorboard model_desc folder has been depreciated, the body of the function has been totally removed and just throws the error: Link to model_desc.py (see line 136)

LucasMoskun avatar Apr 28 '19 07:04 LucasMoskun

For a hack fix, this version/release of tensorpack seems to be compiling: (0.9.0.1) https://github.com/tensorpack/tensorpack/archive/0.9.0.1.zip

LucasMoskun avatar Apr 28 '19 08:04 LucasMoskun

For a hack fix, this version/release of tensorpack seems to be compiling: (0.9.0.1) https://github.com/tensorpack/tensorpack/archive/0.9.0.1.zip


Thanks for the suggestion.

I have installed tensorpack 0.9.0.1, the error is gone. Unfortunately, I got other strange errors. Actually, I am testing the code with the following datasets : timit and arctic. Later, if I it works, I am planing to replace arctic with my own dataset.


case: case, logdir: /data/private/vc/logdir/case/train1 /data/private/vc/datasets/timit/TIMIT/TRAIN///*.wav [0428 13:15:37 @logger.py:108] WRN Log directory /data/private/vc/logdir/case/train1 exists! Use 'd' to delete it. [0428 13:15:37 @logger.py:111] WRN If you're resuming from a previous run, you can choose to keep it. Press any other key to exit. Select Action: k (keep) / d (delete) / q (quit):d [0428 13:15:46 @logger.py:73] Argv: train1.py case -gpu 0 [0428 13:15:46 @parallel.py:186] [MultiProcessPrefetchData] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d. [0428 13:15:46 @argtools.py:146] WRN Install python-prctl so that processes can be cleaned with guarantee. [0428 13:15:46 @config.py:165] WRN TrainConfig.nr_tower was deprecated! Set the number of GPUs on the trainer instead! [0428 13:15:47 @config.py:166] WRN See https://github.com/tensorpack/tensorpack/issues/458 for more information. -----OK(---------- [0428 13:15:47 @training.py:52] [DataParallel] Training a model of 2 towers. [0428 13:15:47 @training.py:54] ERR [DataParallel] TensorFlow was not built with CUDA support! [0428 13:15:47 @interface.py:46] Automatically applying StagingInput on the DataFlow. [0428 13:15:47 @develop.py:96] WRN [Deprecated] ModelDescBase._get_inputs() interface will be deprecated after 30 Mar. Use inputs() instead! [0428 13:15:47 @input_source.py:220] Setting up the queue 'QueueInput/input_queue' for CPU prefetching ... [0428 13:15:47 @training.py:112] Building graph for training tower 0 on device /gpu:0 ... [0428 13:15:47 @develop.py:96] WRN [Deprecated] ModelDescBase._build_graph() interface will be deprecated after 30 Mar. Use build_graph() instead! Process _Worker-4: Process _Worker-2: Process _Worker-1: Traceback (most recent call last): Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() ValueError: operands could not be broadcast together with shapes (1,257) (0,) File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) Process _Worker-3: Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) [0428 13:15:48 @develop.py:96] WRN [Deprecated] get_cost() and self.cost will be deprecated after 30 Mar. Return the cost tensor directly in build_graph() instead! [0428 13:15:48 @develop.py:96] WRN [Deprecated] ModelDescBase._get_optimizer() interface will be deprecated after 30 Mar. Use optimizer() instead! [0428 13:15:49 @training.py:112] Building graph for training tower 1 on device /gpu:1 ... [0428 13:15:49 @develop.py:96] WRN [Deprecated] ModelDescBase._build_graph() interface will be deprecated after 30 Mar. Use build_graph() instead! [0428 13:15:49 @develop.py:96] WRN [Deprecated] get_cost() and self.cost will be deprecated after 30 Mar. Return the cost tensor directly in build_graph() instead! [0428 13:15:51 @collection.py:164] These collections were modified but restored in tower1: (tf.GraphKeys.SUMMARIES: 3->5) [0428 13:15:52 @training.py:322] 'sync_variables_from_main_tower' includes 174 operations. [0428 13:15:52 @model_utils.py:64] Trainable Variables: name shape dim


net1/prenet/dense1/kernel:0 [40, 128] 5120 net1/prenet/dense1/bias:0 [128] 128 net1/prenet/dense2/kernel:0 [128, 64] 8192 net1/prenet/dense2/bias:0 [64] 64 net1/cbhg/conv1d_banks/num_1/conv1d/conv1d/kernel:0 [1, 64, 64] 4096 net1/cbhg/conv1d_banks/num_1/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_1/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_2/conv1d/conv1d/kernel:0 [2, 64, 64] 8192 net1/cbhg/conv1d_banks/num_2/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_2/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_3/conv1d/conv1d/kernel:0 [3, 64, 64] 12288 net1/cbhg/conv1d_banks/num_3/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_3/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_4/conv1d/conv1d/kernel:0 [4, 64, 64] 16384 net1/cbhg/conv1d_banks/num_4/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_4/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_5/conv1d/conv1d/kernel:0 [5, 64, 64] 20480 net1/cbhg/conv1d_banks/num_5/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_5/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_6/conv1d/conv1d/kernel:0 [6, 64, 64] 24576 net1/cbhg/conv1d_banks/num_6/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_6/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_7/conv1d/conv1d/kernel:0 [7, 64, 64] 28672 net1/cbhg/conv1d_banks/num_7/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_7/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_8/conv1d/conv1d/kernel:0 [8, 64, 64] 32768 net1/cbhg/conv1d_banks/num_8/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_8/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_1/conv1d/kernel:0 [3, 512, 64] 98304 net1/cbhg/normalize/beta:0 [64] 64 net1/cbhg/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_2/conv1d/kernel:0 [3, 64, 64] 12288 net1/cbhg/highwaynet_0/dense1/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_0/dense1/bias:0 [64] 64 net1/cbhg/highwaynet_0/dense2/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_0/dense2/bias:0 [64] 64 net1/cbhg/highwaynet_1/dense1/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_1/dense1/bias:0 [64] 64 net1/cbhg/highwaynet_1/dense2/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_1/dense2/bias:0 [64] 64 net1/cbhg/highwaynet_2/dense1/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_2/dense1/bias:0 [64] 64 net1/cbhg/highwaynet_2/dense2/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_2/dense2/bias:0 [64] 64 net1/cbhg/highwaynet_3/dense1/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_3/dense1/bias:0 [64] 64 net1/cbhg/highwaynet_3/dense2/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_3/dense2/bias:0 [64] 64 net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/gates/kernel:0 [128, 128] 16384 net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/gates/bias:0 [128] 128 net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/candidate/kernel:0 [128, 64] 8192 net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/candidate/bias:0 [64] 64 net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/gates/kernel:0 [128, 128] 16384 net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/gates/bias:0 [128] 128 net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/candidate/kernel:0 [128, 64] 8192 net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/candidate/bias:0 [64] 64 net1/dense/kernel:0 [128, 61] 7808 net1/dense/bias:0 [61] 61 Total #vars=58, #params=363389, size=1.39MB [0428 13:15:52 @base.py:209] Setup callbacks graph ... [0428 13:15:52 @summary.py:38] Maintain moving average summary of 0 tensors in collection MOVING_SUMMARY_OPS. [0428 13:15:52 @summary.py:75] Summarizing collection 'summaries' of size 3. [0428 13:15:53 @base.py:227] Creating the session ... 2019-04-28 13:15:53.199861: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2019-04-28 13:15:53.199907: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2019-04-28 13:15:53.199917: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2019-04-28 13:15:53.199929: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2019-04-28 13:15:53.199939: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1297, in _run_fn self._extend_graph() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1358, in _extend_graph self._session, graph_def.SerializeToString(), status) File "/usr/lib/python3.6/contextlib.py", line 88, in exit next(self.gen) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels:

 [[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train1.py", line 82, in train(args, logdir=logdir_train1) File "train1.py", line 62, in train launch_train_with_config(train_conf, trainer=trainer) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/interface.py", line 97, in launch_train_with_config extra_callbacks=config.extra_callbacks) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/base.py", line 341, in train_with_defaults steps_per_epoch, starting_epoch, max_epoch) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/base.py", line 312, in train self.initialize(session_creator, session_init) File "/root/.local/lib/python3.6/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/tower.py", line 144, in initialize super(TowerTrainer, self).initialize(session_creator, session_init) File "/root/.local/lib/python3.6/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/base.py", line 229, in initialize self.sess = session_creator.create_session() File "/root/.local/lib/python3.6/site-packages/tensorpack/tfutils/sesscreate.py", line 43, in create_session sess.run(tf.global_variables_initializer()) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels:

 [[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]]

Caused by op 'AllReduceGrads/NcclAllReduce_105', defined at: File "train1.py", line 82, in train(args, logdir=logdir_train1) File "train1.py", line 62, in train launch_train_with_config(train_conf, trainer=trainer) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/interface.py", line 87, in launch_train_with_config model._build_graph_get_cost, model.get_optimizer) File "/root/.local/lib/python3.6/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/tower.py", line 204, in setup_graph train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/trainers.py", line 186, in _setup_graph self._make_get_grad_fn(input, get_cost_fn, get_opt_fn), get_opt_fn) File "/root/.local/lib/python3.6/site-packages/tensorpack/graph_builder/training.py", line 244, in build all_grads = allreduce_grads(all_grads, average=self._average) # #gpu x #param File "/root/.local/lib/python3.6/site-packages/tensorpack/tfutils/scope_utils.py", line 94, in wrapper return func(*args, **kwargs) File "/root/.local/lib/python3.6/site-packages/tensorpack/graph_builder/utils.py", line 157, in allreduce_grads summed = nccl.all_sum(grads) File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 48, in all_sum return _apply_all_reduce('sum', tensors) File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 154, in _apply_all_reduce shared_name=shared_name)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/nccl/ops/gen_nccl_ops.py", line 43, in nccl_all_reduce shared_name=shared_name, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels:

 [[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]]

sallyjoy avatar Apr 28 '19 15:04 sallyjoy

Try running without the GPU flag, I'm pretty sure if you are using tensorflow-gpu, but aren't using multiple gpu's, the tensorflow-gpu commands will automatically use the GPU you have already set up.

LucasMoskun avatar Apr 28 '19 15:04 LucasMoskun

Also, I was recieving a lot of strange nccl errors, which I don't need nccl since I am only using one gpu. In train1.py and train2.py I've added from tensorpack.train.trainers import SimpleTrainer and then changed the line trainer = SyncMultiGPUTrainerReplicated(hp.train2.num_gpu) to trainer = SimpleTrainer()

Also, in tensorboard 0.9.0.1 in graph_builder utils.py I had to change the line from tensorflow.contrib import nccl to from tensorflow.python.ops.nccl_ops import all_sum then summed = all_sum(grads) a few lines below. This might not be necessary depending on your tensorflow version.

LucasMoskun avatar Apr 28 '19 16:04 LucasMoskun

Also, I was recieving a lot of strange nccl errors, which I don't need nccl since I am only using one gpu. In train1.py and train2.py I've added from tensorpack.train.trainers import SimpleTrainer and then changed the line trainer = SyncMultiGPUTrainerReplicated(hp.train2.num_gpu) to trainer = SimpleTrainer()

Also, in tensorboard 0.9.0.1 in graph_builder utils.py I had to change the line from tensorflow.contrib import nccl to from tensorflow.python.ops.nccl_ops import all_sum then summed = all_sum(grads) a few lines below. This might not be necessary depending on your tensorflow version.


I have changed Train1.py like you said, it seems working, no error is displayed. But it is a little strange because It starts Epoch 1 by showing this message that never change over time.


[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

sallyjoy avatar Apr 29 '19 14:04 sallyjoy

It starts Epoch 1 but no change in the progress bar, never moves to Epoch 2 and no checkpoint stored after about 8 hours execution.

[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

sallyjoy avatar Apr 30 '19 13:04 sallyjoy

It starts Epoch 1 but no change in the progress bar, never moves to Epoch 2 and no checkpoint stored after about 8 hours execution.

[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

I am also facing the same issue. Please reply if you have been able to solve this issue

syedKhutub avatar May 12 '19 11:05 syedKhutub

Hi guys, After following your solution for this problem, I was also stuck on the same point but then I was able to figure out the solution and start the training. The issue is with .yaml files where data path starts as '/data/private/..' just change it to './data/private/....' by editing those files in the hparams folder. Then if an issue arises for librosa, update librosa to 0.6.2, then it will start working. Maybe this could help you guys.

Muhammad-MujtabaSaeed avatar May 13 '19 06:05 Muhammad-MujtabaSaeed

@Muhammad-MujtabaSaeed I have considered those changes but I am still stuck with this issue. It would be nice of you if you can share the changes you have done so that i can cross check the changes.

syedKhutub avatar May 13 '19 11:05 syedKhutub

@Muhammad-MujtabaSaeed I have considered those changes but I am still stuck with this issue. It would be nice of you if you can share the changes you have done so that i can cross check the changes.

Still stuck. It doesn't change anything for me too.

sallyjoy avatar May 13 '19 13:05 sallyjoy

Anyone figure this out?^

mattpeng3 avatar May 22 '19 22:05 mattpeng3

Is there any progress with this issue? Stuck too.

sinKettu avatar Aug 12 '19 09:08 sinKettu

after installing tensorpack 0.9.0.1, another error comes out: Traceback (most recent call last): File "train1.py", line 78, in <module> train(args, logdir=logdir_train1) File "train1.py", line 60, in train launch_train_with_config(train_conf, trainer=trainer) File "/usr/local/lib/python2.7/dist-packages/tensorpack/train/interface.py", line 90, in launch_train_with_config model.get_input_signature(), input, File "/usr/local/lib/python2.7/dist-packages/tensorpack/utils/argtools.py", line 200, in wrapper value = func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorpack/graph_builder/model_desc.py", line 90, in get_input_signature inputs = self.inputs() File "/usr/local/lib/python2.7/dist-packages/tensorpack/graph_builder/model_desc.py", line 122, in inputs raise NotImplementedError() NotImplementedError

I' not sure if there is some problem with my tensorflow version? Does anyone meet with the same problem? Stucking...

yifanliuu avatar Nov 10 '19 11:11 yifanliuu

这个问题有什么进展吗?也卡住了

Hello i am also experiencing the same problem now, did you solve it later

neil3212080 avatar Feb 26 '20 02:02 neil3212080