bi-att-flow icon indicating copy to clipboard operation
bi-att-flow copied to clipboard

Variable model_1/loss/ExponentialMovingAverage/ does not exist

Open demiguo opened this issue 6 years ago • 13 comments

HI, I'm running the dev branch code on Tensorflow 1.2.

And I got this error: Variable model_1/loss/ExponentialMovingAverage/ does not exist or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope.

From the stack trace, it was from basic/model.py, in _build_ema, ema_op=ema.apply(tensors).

I tried to add "with tf.variable_scope(tf.get_variable_scope(), reuse=False):" before eam.apply but that still doesn't work.

Any ideas how can I fix this?

Thanks! I'm using CUDA8.0 and Cudnn5.1. Tensorflow v1.2, python 3.5.

demiguo avatar Aug 23 '17 18:08 demiguo

I found this problem only occurs in multi-GPU training. It's fine to use the same code without --num_gpus>1.

Gandor26 avatar Sep 18 '17 00:09 Gandor26

@demiguo I have same issue, had you solved the problem?

xingjinglu avatar Feb 07 '18 09:02 xingjinglu

@demiguo I have the same issue. Because of this I can't train on a multiple GPU setup. Has anyone solved the problem?

uditsaxena avatar Feb 09 '18 05:02 uditsaxena

I have the same problem

ghost avatar Apr 13 '18 12:04 ghost

Anyone found anything?

vidhumalik avatar Apr 21 '18 12:04 vidhumalik

Same issue occurs on version patched to run on TF r.17 Note the following version works on 1 gpu but not on >1 gpu https://github.com/klintan/bi-att-flow/tree/dev

Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/levinth/bi-att-flow-zt1/basic/cli.py", line 128, in tf.app.run() File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/home/levinth/bi-att-flow-zt1/basic/cli.py", line 125, in main m(config) File "/home/levinth/bi-att-flow-zt1/basic/main.py", line 26, in main _train(config) File "/home/levinth/bi-att-flow-zt1/basic/main.py", line 85, in _train models = get_multi_gpu_models(config) File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 21, in get_multi_gpu_models model = Model(config, scope, rep=gpu_idx == 0) File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 68, in init self._build_ema() File "/home/levinth/bi-att-flow-zt1/basic/model.py", line 298, in _build_ema ema_op = ema.apply(tensors) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/moving_averages.py", line 405, in apply "VarHandleOp"])) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 179, in create_zeros_slot colocate_with_primary=colocate_with_primary) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 156, in create_slot_with_initializer dtype) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/training/slot_creator.py", line 65, in _create_slot_var validate_shape=validate_shape) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 1297, in get_variable constraint=constraint) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 1093, in get_variable constraint=constraint) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 439, in get_variable constraint=constraint) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 408, in _true_getter use_resource=use_resource, constraint=constraint) File "/home/levinth/tf_r1.7_c91_712_py3/tensorflow/python/ops/variable_scope.py", line 765, in _get_single_variable "reuse=tf.AUTO_REUSE in VarScope?" % name) ValueError: Variable model_1/loss/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?

David-Levinthal avatar Apr 21 '18 16:04 David-Levinthal

Hi David,

My name is Tian. I'm moving our discussion from email to this issue ticket so that other developers who have issues with this error could see it.

The issue with the original bidaf implementation is that it only creates one loss variable for the model. On a single GPU this is fine, because you only have one model. In a multi-GPU setting, the way bidaf implements multi-GPU training is that it replicates the model for every GPU device, and assigns one model on one device. This means that each model would require its own loss variable. If the developer only specifies one loss variable, tensorflow would try to reuse the loss variable for every model, which would create a conflict.

For example, in your error, if you only have one device, the name of the loss variable would be model_0/loss/ExponentialMovingAverage. If you have two devices, another loss variable called model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. Since this variable is not created before you generate the whole model, tensorflow would try to reuse the variable you previously generated for model_0. Does that make sense?

The solution to resolve this conflict is by creating a loss variable for every model that's replicated: https://github.com/stanford-futuredata/dawn-bench-models/blob/master/tensorflow/SQuAD/basic/model.py#L25:L36.

Unfortunately I don't have a mult-GPU node available. Would you mind try this patch on your node and see if it works?

kelayamatoz avatar May 01 '18 21:05 kelayamatoz

I will modify the Andreas Klintberg fork of this..as that is the code base that works on top of tree TF..nothing else does due to the change in the handling of flags https://www.linkedin.com/in/andreas-klintberg-b7655710/

but I have to wait until the 4 GPU machine I set this up on gets freed up... I don't want to build everything again LOL d

On Tue, May 1, 2018 at 2:58 PM, kelayamatoz [email protected] wrote:

Hi David,

My name is Tian. I'm moving our discussion from email to this issue ticket so that other developers who have issues with this error could see it.

The issue with the original bidaf implementation is that it only creates one loss variable for the model. On a single GPU this is fine, because you only have one model. In a multi-GPU setting, the way bidaf implements multi-GPU training is that it replicates the model for every GPU device, and assigns one model on one device. This means that each model would require its own loss variable. If the developer only specifies one loss variable, tensorflow would try to reuse the loss variable for every model, which would create a conflict.

For example, in your error, if you only have one device, the name of the loss variable would be model_0/loss/ExponentialMovingAverage. If you have two devices, another loss variable called model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. Since this variable is not created before you generate the whole model, tensorflow would try to reuse the variable you previously generated for model_0. Does that make sense?

The solution to resolve this conflict is by creating a loss variable for every model that's replicated: https://github.com/stanford-futuredata/dawn-bench-models/ blob/master/tensorflow/SQuAD/basic/model.py#L25:L36.

Unfortunately I don't have a mult-GPU node available. Would you mind try this patch on your node and see if it works?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/54#issuecomment-385803301, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq .

David-Levinthal avatar May 01 '18 22:05 David-Levinthal

new distro is https://github.com/klintan/bi-att-flow/tree/dev

On Tue, May 1, 2018 at 3:45 PM, David Levinthal [email protected] wrote:

I will modify the Andreas Klintberg fork of this..as that is the code base that works on top of tree TF..nothing else does due to the change in the handling of flags https://www.linkedin.com/in/andreas-klintberg-b7655710/

but I have to wait until the 4 GPU machine I set this up on gets freed up... I don't want to build everything again LOL d

On Tue, May 1, 2018 at 2:58 PM, kelayamatoz [email protected] wrote:

Hi David,

My name is Tian. I'm moving our discussion from email to this issue ticket so that other developers who have issues with this error could see it.

The issue with the original bidaf implementation is that it only creates one loss variable for the model. On a single GPU this is fine, because you only have one model. In a multi-GPU setting, the way bidaf implements multi-GPU training is that it replicates the model for every GPU device, and assigns one model on one device. This means that each model would require its own loss variable. If the developer only specifies one loss variable, tensorflow would try to reuse the loss variable for every model, which would create a conflict.

For example, in your error, if you only have one device, the name of the loss variable would be model_0/loss/ExponentialMovingAverage. If you have two devices, another loss variable called model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. Since this variable is not created before you generate the whole model, tensorflow would try to reuse the variable you previously generated for model_0. Does that make sense?

The solution to resolve this conflict is by creating a loss variable for every model that's replicated: https://github.com/stanford-futuredata/dawn-bench-models/blo b/master/tensorflow/SQuAD/basic/model.py#L25:L36.

Unfortunately I don't have a mult-GPU node available. Would you mind try this patch on your node and see if it works?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/54#issuecomment-385803301, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq .

David-Levinthal avatar May 02 '18 20:05 David-Levinthal

took model.py from the dawn distribution and added it to my modified (for printout and speed logging) version of the klintberg distro It appears to not actually pay attention to the num_gpus flag and started 4 processes on the 4 V100s used CUDA_VISIBLE_DEVICES=0 and num_gpus=1 default batch size global_step: 100, avg_loss = 8.443472, time = 336.919500

global_step: 200, avg_loss = 7.691555, time = 333.118028

global_step: 300, avg_loss = 7.293336, time = 332.794547

global_step: 400, avg_loss = 6.585279, time = 330.941432 nvidia-smi showed 1 process

unsetting CUDA_VISIBLE_DEVICES and rerunning one see 4 processes in nvidia-smi..but only 1 GPU being active :-) global_step: 100, avg_loss = 8.414022, time = 353.267652

global_step: 200, avg_loss = 7.680358, time = 343.982275

global_step: 300, avg_loss = 7.316520, time = 346.749630

global_step: 400, avg_loss = 6.531937, time = 344.386899

set CUDA_VISIBLE_DEVICES=0,1,2,3 and num_gpus=4 global_step: 100, avg_loss = 8.122040, time = 669.353025

global_step: 200, avg_loss = 6.920084, time = 651.585218

global_step: 300, avg_loss = 5.956660, time = 648.792006

global_step: 400, avg_loss = 5.126814, time = 650.647822

global_step: 500, avg_loss = 4.219713, time = 648.755034

so I am bit unsure exactly whether things are going faster when fanned out lowering the batch size to 15 while running on 4 GPUs does not change the output much global_step: 100, avg_loss = 8.417717, time = 518.246054

global_step: 200, avg_loss = 7.639581, time = 492.312914

global_step: 300, avg_loss = 7.274171, time = 493.591461

global_step: 400, avg_loss = 6.699906, time = 506.484376

On Wed, May 2, 2018 at 1:29 PM, David Levinthal [email protected] wrote:

new distro is https://github.com/klintan/bi-att-flow/tree/dev

On Tue, May 1, 2018 at 3:45 PM, David Levinthal < [email protected]> wrote:

I will modify the Andreas Klintberg fork of this..as that is the code base that works on top of tree TF..nothing else does due to the change in the handling of flags https://www.linkedin.com/in/andreas-klintberg-b7655710/

but I have to wait until the 4 GPU machine I set this up on gets freed up... I don't want to build everything again LOL d

On Tue, May 1, 2018 at 2:58 PM, kelayamatoz [email protected] wrote:

Hi David,

My name is Tian. I'm moving our discussion from email to this issue ticket so that other developers who have issues with this error could see it.

The issue with the original bidaf implementation is that it only creates one loss variable for the model. On a single GPU this is fine, because you only have one model. In a multi-GPU setting, the way bidaf implements multi-GPU training is that it replicates the model for every GPU device, and assigns one model on one device. This means that each model would require its own loss variable. If the developer only specifies one loss variable, tensorflow would try to reuse the loss variable for every model, which would create a conflict.

For example, in your error, if you only have one device, the name of the loss variable would be model_0/loss/ExponentialMovingAverage. If you have two devices, another loss variable called model_1/loss/ExponentialMovingAverage would be referenced by tensorflow. Since this variable is not created before you generate the whole model, tensorflow would try to reuse the variable you previously generated for model_0. Does that make sense?

The solution to resolve this conflict is by creating a loss variable for every model that's replicated: https://github.com/stanford-futuredata/dawn-bench-models/blo b/master/tensorflow/SQuAD/basic/model.py#L25:L36.

Unfortunately I don't have a mult-GPU node available. Would you mind try this patch on your node and see if it works?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/allenai/bi-att-flow/issues/54#issuecomment-385803301, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT2AyY-Ru255Tw_q4MHoDVgkGKk9uks5tuNqCgaJpZM4PAcOq .

David-Levinthal avatar May 03 '18 20:05 David-Levinthal

Hi @demiguo. On tensorflow 1.12.0, I had the same problem and fixed it by adding the line:

        with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):

before ema.apply

shimafoolad avatar Nov 25 '18 08:11 shimafoolad

First I would like to give some context to this issue. It applies only when a different WorkerX/GPUX/etcX tf.name_scope() was created over different instantations of any model that uses tf.train.ExponentialMovingAverage (commonly used by Batch Normalization). If instead there had been used a "WorkerX" tf.variable_scope(), there would be no possibility of reuse, because variables created with tf.get_variable() only ignore tf.name_scopes(). Thus using tf.get_variable() inside different tf.variable_scope() can only be different variables.

On the other hand, if there were no tf.name_scope() nor tf.variable_scope() over different gpu workers, variables being created with either tf.Variables or tf.get_variable would have exactly the same scope, giving both the possibility of being properly reused, but probably not creating a very pretty underlying graph, because operations would not be agregated over workers (esthetical/design/maintenance issue).

But as I understand, using with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE) before ema.apply, as @shimafoolad suggests, will drop reuse of loss/ExponentialMovingAverage/ across GPUs, and any shadow variable created with tf.Variable by ema.apply. That would be bad for the learning of Batch Normalization layers in distributed learning, which does not seem to be a good solution.

Maybe there is a way in which main variables of BN layers would be reused anyway, but I have found no explanation of such mechanism and maybe this issue would be solved with such an explanation.

masotrix avatar Aug 15 '19 16:08 masotrix

Thougth it again and realized that trainable parámeters of Batch Normalization can be defined with tf.get_variable() and use "with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE)" as @shimafoolad says, as follows:

def batch_norm_template(inputs, is_training, scope,
        moments_dims, bn_decay, reuse):
  with tf.variable_scope(scope, reuse=reuse) as sc: 
    num_channels = inputs.get_shape()[-1].value
    beta = tf.get_variable('beta', None, None,
        tf.constant(0.0, tf.float32, [num_channels]), None, True)
    gamma = tf.get_variable('gamma', None, None,
        tf.constant(1.0, tf.float32, [num_channels]), None, True)
    batch_mean, batch_var = tf.nn.moments(inputs,
            moments_dims, name='moments')
    decay = bn_decay if bn_decay is not None else 0.9 
    ema = tf.train.ExponentialMovingAverage(decay=decay)

    # Operator that maintains moving averages of variables.
    with tf.variable_scope(tf.get_variable_scope(),
        reuse=tf.AUTO_REUSE):
      ema_apply_op = tf.cond(is_training,
                lambda: ema.apply([batch_mean, batch_var]),
                lambda: tf.no_op())
    
    # Update moving average, return current batch's avg and var.
    def mean_var_with_update():
      with tf.control_dependencies([ema_apply_op]):
        return tf.identity(batch_mean), tf.identity(batch_var)
    
    # ema.average returns the Variable holding the average of var.
    mean, var = tf.cond(is_training,
        mean_var_with_update,
        lambda: (ema.average(batch_mean), ema.average(batch_var)))
    normed = tf.nn.batch_normalization(inputs, mean, var,
            beta, gamma, 1e-3)

  return normed

For safety, reuse should be False the first call and True all after, as done in https://wizardforcel.gitbooks.io/tensorflow-examples-aymericdamien/6.2_multigpu_cnn.html.

Hope it helps, especially those who update legacy tensorflow

masotrix avatar Aug 15 '19 17:08 masotrix