bi-att-flow
bi-att-flow copied to clipboard
ValueError when running training with Multi-GPU
Dear Team,
I am running training with 2 K80 Nvidia GPUs. I tried both dev branch with tf 1.2.0, python 3.6.2 with the following line:
python3 -m basic.cli --mode train --noload --num_gpus 2 --batch_size 30
However the program quits with the errors attached. We are a bit confused on how to track what's causing the error, and we are wondering if we could get some help?
Here starts the log of the error:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dawn-benchmark/tensorflow-qa-orig/bi-att-flow/basic/cli.py", line 112, in
It seems that the ExponentailMovingAverage doesn't help with the training processs. Disabling the related lines would solve this issue.
After resolving this issue, I also discovered a similar reuse-variable issue with the Adam Optimizer. It seems that there is an implicit global variable scope that's enforcing all the variables, including the ones created by the optimizers, to be reusable. Adding an explicit declaration of variable scope before the for-loop of gpu model creation would solve the issue.
hello: I meet the same problem as you, but I don't understand your solution. Which line did you disable to solve this issue? Can you say more details? Thanks a lot!
Hey @distantJing:
Check this one:
https://github.com/kelayamatoz/bi-att-flow-lstm-extractor/blob/master/basic/model.py#L25-L36
Essentially you need to put the exponential smoothing variables into a different scope and make sure that each GPU gets a unique set of loss variables.
Dear Team,
I tried to train the model with 2 gpus using the following line:
/opt/python3/bin/python3 -m basic.cli --mode train --noload --debug --len_opt --batch_size 20 --num_gpus 2
Then I got an error:
ValueError: Attempt to have a second RNNCell use the weights of a variable scope that already has weights: 'prepro/u1/fw/basic_lstm_cell'; and the cell was not constructed as BasicLSTMCell(..., reuse=True). To share the weights of an RNNCell, simply reuse it in your second calculation, or create a new one with the argument reuse=True.
I know it has something to do with multi-gpu training. But I don't know how to revise the code.
Thanks a lot!
@kelayamatoz I found your repository https://github.com/kelayamatoz/BiDAF-MultiGPU-Fix And i ran it up on multi-gpus Thanks a lot !
@chiahsuan156 i run the repository as your reffered https://github.com/kelayamatoz/BiDAF-MultiGPU-Fix,but i meet the same error as @dengyuning do have you resolved it?