returnn
returnn copied to clipboard
Reusing parameter with different variable name not working
As discussed in #446, ReuseParams.get_absolute_name_scope_prefix does not infer the name of a reused parameter correctly if a custom_func is used.
I have added a simple test case in #448, where I share the W parameter of a LinearLayer with the QKV parameter of a SelfAttentionLayer:
"self_att": {"class": "self_attention", "num_heads": n_heads, "total_key_dim": n_total, "n_out": n_total},
"linear": {"class": "linear", "n_out": n_total * 3, "activation": None, "with_bias": False,
"reuse_params": {
"auto_create_missing": False, # should not matter as we do not have any bias
"map": {"W": {"reuse_layer": "self_att", "custom": custom}}}}
with
def custom(reuse_layer, *args, **kwargs):
return reuse_layer.params['QKV']
This gives the error:
File "/home/runner/work/returnn/returnn/returnn/tf/network.py", line 703, in _create_layer
line: layer = layer_class(**layer_desc)
locals:
layer = <not found>
layer_class = <local> <class 'returnn.tf.layers.basic.LinearLayer'>
layer_desc = <local> {'n_out': 40, 'activation': None, 'with_bias': False, 'reuse_params': <ReuseParams reuse_layer None, map {'W': <ReuseParams reuse_layer <SelfAttentionLayer 'self_att' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>, map None>}>, 'sources': [<SourceLayer '..., len = 8
File "/home/runner/work/returnn/returnn/returnn/tf/layers/basic.py", line 1430, in __init__
line: weights = self.add_param(tf_compat.v1.get_variable(
name="W", shape=weights_shape, dtype=tf.float32, initializer=fwd_weights_initializer))
locals:
weights = <not found>
self = <local> <LinearLayer 'linear' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>
self.add_param = <local> <bound method LayerBase.add_param of <LinearLayer 'linear' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>>
tf_compat = <global> <module 'returnn.tf.compat' from '/home/runner/work/returnn/returnn/returnn/tf/compat.py'>
tf_compat.v1 = <global> <module 'tensorflow._api.v2.compat.v1' from '/home/runner/.local/lib/python3.7/site-packages/tensorflow/_api/v2/compat/v1/__init__.py'>
tf_compat.v1.get_variable = <global> <function get_variable at 0x7fe7e9d20cb0>
name = <not found>
shape = <not found>
weights_shape = <local> (40, 40)
dtype = <not found>
tf = <global> <module 'tensorflow' from '/home/runner/.local/lib/python3.7/site-packages/tensorflow/__init__.py'>
tf.float32 = <global> tf.float32
initializer = <not found>
fwd_weights_initializer = <local> <tensorflow.python.ops.init_ops.GlorotUniform object at 0x7fe7dc3df790>
File "/home/runner/work/returnn/returnn/returnn/tf/layers/base.py", line 840, in add_param
line: name_scope_prefix = self.reuse_params.get_absolute_name_scope_prefix(base_layer=self, param=param)
locals:
name_scope_prefix = <not found>
self = <local> <LinearLayer 'linear' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>
self.reuse_params = <local> <ReuseParams reuse_layer None, map {'W': <ReuseParams reuse_layer <SelfAttentionLayer 'self_att' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>, map None>}>
self.reuse_params.get_absolute_name_scope_prefix = <local> <bound method ReuseParams.get_absolute_name_scope_prefix of <ReuseParams reuse_layer None, map {'W': <ReuseParams reuse_layer <SelfAttentionLayer 'self_att' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>, map None>}>>
base_layer = <not found>
param = <local> <tf.Variable 'self_att/QKV:0' shape=(40, 120) dtype=float32>
File "/home/runner/work/returnn/returnn/returnn/tf/layers/base.py", line 1714, in get_absolute_name_scope_prefix
line: assert self.auto_create_missing
locals:
self = <local> <ReuseParams reuse_layer None, map {'W': <ReuseParams reuse_layer <SelfAttentionLayer 'self_att' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>, map None>}>
self.auto_create_missing = <local> False
AssertionError:
As @albertz mentioned in #446:
The bug is somewhat clear. When you get into add_param, you already have gotten the param correctly. That is correctly done in tf.get_variable via the variable scope (via var_creation_scope). However, the task of add_param is just to add it to self.params, under the correct name (W here). The code is somewhat complicated and complex for that, because we try to infer it from the created variable.
I think this needs to be changed. The whole name infer logic for the case we do something custom with the parameter. We can catch that in var_creation_scope. We could still leave the (simple) default case. But whenever we do sth custom there, we could catch the name from a custom getter, and then store that in param._RETURNN_layer_map_name or so, and then get that in add_param.
I was thinking a bit further about this. The problem is that self.params of a layer is somewhat ambiguous (or rather not really well defined):
- Should we guarantee that
layer.params["W"]for aLinearLayeralways exists, and represents the weights for multiplication? (The layer could also exposeself.weightsor so, if we need to have sth like that.) So, that means in certain cases (e.g. with weight norm, or weight noise, or weight dropout, or custom getter, etc) it is not atf.Variablebut atf.Tensor. - Should we guarantee that
layer.paramsvalues are alwaystf.Variable? - If a variable is in
layer.params, does it mean that this layer owns this variable? (So e.g. reused params would then not end up there.) - In case the layer has sublayers, should
layer.paramsalso contain all params of all sublayers (and then recursively)? (This is currently the case.)
It's currently used to collect all variables in a network, by iterating over all layers and collecting all variables from layer.params. Of course, we can easily change that code/logic.
It is also used for the reuse_params logic. I'm not really sure whether you would reuse params from another layer which itself reuses the params from somewhere else. Probably noone has ever used it like that. It is probably more clean if you directly reuse the params where they are created.
So maybe we should change it, that layer.params only contains variables owned/created by the layer. Then the logic in add_param also becomes simpler. If the namespace of the variable does not match the namespace of the layer, it would simply not add it to self.params.
I'm not really sure whether you would reuse params from another layer which itself reuses the params from somewhere else. Probably noone has ever used it like that. It is probably more clean if you directly reuse the params where they are created.
Ironically, the config I have been using since I started here and received from a Hiwi before me, does exactly this:
There is a layer target_embed, which reuses the params of source_embed.
Then, there is output_prob (a projection layer), that reuses the transposed parameters of target_embed (using a custom_func calling tf.transpose).
We can of course in this case easily change it that output_prob reuses the parameters directly from source_embed.
But especially when using a custom_func to transform the parameters, such transitive parameter reusing might be pretty useful.
I think one major thing I would not distinguish but (as far as I can tell) is handled differently right now is whether parameters are shared using a custom_func with some operation applied or whether it is simple copying:
In my example above, layer.params["W"] exists for source_embed and target_embed, but not for output_prob.
That's unintuitive I would say.
So maybe we should change it, that
layer.paramsonly contains variables owned/created by the layer. Then the logic inadd_paramalso becomes simpler. If the namespace of the variable does not match the namespace of the layer, it would simply not add it toself.params.
That proposal would fix exactly that problem. However, it has the downside that such transitive reusing as in my config above no longer work (if we do not add layer.reused_params or sth at least).
I don't really know how many people use that right now, but I would guess at least in the MT group some configs would break (even though simple to fix).
Another option I can think of is:
layer.params["W"] always exists, but is a tf.Variable exactly iff it is owned by the layer, and otherwise a tf.Tensor (iff it was reused from somewhere else).
This means sharing parameters without transforming them would always apply tf.identity.
Then, configs as above would still work, but it is also clear if a parameter is owned by the layer or just reused.
The logic in add_param would also become simpler I think, as it could just always add the param to layer.params.
But I don't know the bigger picture, maybe this is non-sense :D
Edit: Hm, oh wait, but that does not fix the current issue here. We would then still need to keep track of the variable name somehow.
Btw, because of this current distinguishing between tf.Variable and tf.Tensor in add_param, you can bypass this issue by using the custom_func (as a temporary solution):
def custom(reuse_layer, *args, **kwargs):
return tf.identity(reuse_layer.params['QKV'])
In this case, the logic of add_param is already as @albertz originally suggested (do not add reused variables to layer.params).
Yea, but while tf.identity is cheap, I would avoid it, esp just to have it as a workaround. But we don't need some workaround like that. First of all, it's not really so important to know who owns it, and we can also always derive it from the namespace (or also by other means if needed).
Good that you mention that this use case seems more common than I thought. We should really try to avoid breaking configs. Esp when it is not so difficult to keep them working.
Maybe we can keep the logic of layer.params as is, and always add the param. And also under the right name. This is like my suggestion via var_creation_scope and param._RETURNN_layer_map_name.
I wonder a bit about more complex cases, when it becomes a tensor due to some transformations, e.g. when you add weight noise or so. Should we add that tensor to layer.params then, or the original variable? However, not in all cases the "original variable" even exists. E.g. maybe due to some custom getter, it is internally represented as a low-rank approximation, or with weight norm, or whatever. And then, when you want to reuse that param in another layer, should it use the transformed tensor, or the original variable? If it uses the tensor, we need to make sure that weight noise or other things are not applied twice. Although maybe some transformation you want to apply (I don't know...). This is maybe tricky to get right in all possible cases.
Maybe we can keep the logic of
layer.paramsas is, and always add the param. And also under the right name. This is like my suggestion viavar_creation_scopeandparam._RETURNN_layer_map_name.
Okay sounds good then!
I wonder a bit about more complex cases, when it becomes a tensor due to some transformations, e.g. when you add weight noise or so. Should we add that tensor to
layer.paramsthen, or the original variable? However, not in all cases the "original variable" even exists. E.g. maybe due to some custom getter, it is internally represented as a low-rank approximation, or with weight norm, or whatever. And then, when you want to reuse that param in another layer, should it use the transformed tensor, or the original variable? If it uses the tensor, we need to make sure that weight noise or other things are not applied twice. Although maybe some transformation you want to apply (I don't know...). This is maybe tricky to get right in all possible cases.
I also don't really know, apart from transposing matrices I have not used this before. Intuitively I would say layer.params should exactly contain what add_param returns (i.e. the transformed tensor). Maybe we can discuss this further elsewhere, unrelated to this concrete issue.
I also don't really know, apart from transposing matrices I have not used this before. Intuitively I would say
layer.paramsshould exactly contain whatadd_paramreturns (i.e. the transformed tensor).
Yes, I think this is probably the most reasonable solution.
So I think most of it is clear then. (var_creation_scope and param._RETURNN_layer_map_name might be a bit technically involved, but still should be straight forward.)
What's maybe missing is the logic of things like weight noise, weight dropout, weight norm and other potential things a layer could apply. I.e. the decision when to apply it. I can think of several heuristics (all with some advantages/disadvantages):
- Apply if it is a
tf.Variable. This mostly works. But when you use multiple transformations (e.g. weight noise + weight norm), it would not work. - Apply if the layer owns the tensor or variable, simply by checking the namespace. This also might be good. But e.g. if the new layer has weight norm or weight dropout enabled, but the original layer not (from reuse_layer), it would ignore the setting of the new layer. But maybe that's ok, and we should not really overthink this. This also works if multiple transformations are applied (in the owning layer).
Maybe we can discuss this further elsewhere, unrelated to this concrete issue.
I think here is a good place, because this issue is exactly about that problem (or one symptom of it). But also, I think we are already almost finished. I think everything is clear how we could do it.