returnn
returnn copied to clipboard
Data special axes (feature, time) should be removed
A Data
object has time_dim_axis
and feature_dim_axis
. There are many automatic rules how to automatically define and set them. But these rules are somewhat arbitrary and not always straight-forward.
Many layers use these special axes to perform certain operations on it.
(E.g. LinearLayer
uses the feature axis. RecLayer
uses both time and feature. Etc.)
The user can be sometimes confused or not notice when feature or time is set in a specific way.
So this issue is the proposal to completely remove the special axes in Data
.
Related: DimensionTag
also distinguishes between feature and spatial. But this distinction is completely arbitrary. There also can be multiple spatial and and multiple feature dims. I'm not sure if it is really helpful. Maybe this distinction should also be removed?
I'm not sure yet about how to handle this in all the layers which make use of the special axes. Those layers can all accept some axis option to make it explicit. However, this has problems:
- We do not want to break old configs. So somehow the old behavior must still work.
- Maybe the user code becomes to complicated to write? Writing
linear
always with specifying the axis?
In some cases, we maybe can infer some reasonable default. Similar as feature_dim_axis
already behaves when it is set to NotSpecified
. But this should not be too arbitrary, and we have to be careful here. Also we still might break old configs this way.
Or keep the DimensionTag
distinction, and infer the default time/feature from it, when it is unique? And if it is not unique, it would throw an error, and require the user to specify it explicitly.
Note, related is #391, which also includes some comments on this. Also related are the recent changes on dim tags via #579.
One thing I also don't like about this currently is that it requires a lot of extra code when writing layers: E.g. we currently have this spaghetti moster in GatherLayer
, just because feature_dim_axis
may be NotSpecified
:
# feature_dim_axis needs to be handled differently if it is NotSpecified
if (input_data.feature_dim_axis_or_unspecified is NotSpecified and
position_data.feature_dim_axis_or_unspecified is NotSpecified):
out_type["feature_dim_axis"] = NotSpecified
elif input_data.feature_dim_axis_or_unspecified is NotSpecified:
assert position_data.feature_dim_axis_or_unspecified is not NotSpecified
if position_data.feature_dim_axis is not None:
out_type["feature_dim_axis"] = cls._translate_position_axis(
position_data.feature_dim_axis, old_gather_axis, common_axes_position, position_axes)
else:
out_type["feature_dim_axis"] = NotSpecified
elif position_data.feature_dim_axis_or_unspecified is NotSpecified:
assert input_data.feature_dim_axis_or_unspecified is not NotSpecified
if input_data.feature_dim_axis is not None:
out_type["feature_dim_axis"] = cls._translate_input_axis(
input_data.feature_dim_axis, old_gather_axis, common_axes_input, input_axes, position_axes)
else:
out_type["feature_dim_axis"] = NotSpecified
else:
assert input_data.feature_dim_axis_or_unspecified is not NotSpecified
assert position_data.feature_dim_axis_or_unspecified is not NotSpecified
pass # keep the logic as before
(just wanted to add this to the discussion. I don't quite know about how to solve this without breaking configs / the other problems you mentioned.)
One thing I also don't like about this currently is that it requires a lot of extra code when writing layers: E.g. we currently have this spaghetti moster in
GatherLayer
, just becausefeature_dim_axis
may beNotSpecified
:
Yep.
We have the problem though that we need to keep this logic around to not break old configs. Maybe we can encapsulate it a bit better though, such that it is not too much mixed with the main logic of the layer itself.
We don't need to introduce such complex logic for new layers, though. If some new layer changes the dimensions / shape or so in any way, it could simply discard the special axes (thus it would fall back to default behavior).
Actually maybe this issue is not really about removing this (which I think is not possible without breaking old configs) but just about disabling it with a new behavior version, and deprecating it.
What we need in any case here is a list of layers and other util functions which make use of these special axes. Maybe a wiki page should be created for this.
And then each layer on this list must get additional support to explicitly specify the axis (e.g. by dim tag). This must happen before we can proceed with this issue here.
Actually maybe this issue is not really about removing this (which I think is not possible without breaking old configs) but just about disabling it with a new behavior version, and deprecating it.
I'm not sure if we wouldn't want to keep it in some form.
Actually, also _ConcatInputLayer
always operates on the feature dim currently (and therefore almost every layer).
That would be too annoying to always specify, I think we still need some sort of feature_dim_axis
.
For the time axis, yes, let's remove it and make it explicit always (maybe keep T
as a shorthand when it is unique).
Actually, also
_ConcatInputLayer
always operates on the feature dim currently (and therefore almost every layer). That would be too annoying to always specify, I think we still need some sort offeature_dim_axis
.
But this concat logic is optional and only used when you have multiple inputs. This can also be disallowed. This can also be explicit instead (maybe by some new ConcatLayer
, where for each input, you explicitly specify the axis / dim tag).
The reasoning for this issue here is that there can be cases where it is not clear / straightforward what axis is actually declared as feature (or time). Not being straightforward is basically implied because the logic for determining the feature_dim_axis
is so complex.
We want to avoid that unexpected behavior can happen. Even more so when this can happen without error, i.e. being unnoticed then. E.g. if the feature axis is set to another axis than the user expects, the concat would still work fine, and LinearLayer
or other layers would still operate perfectly fine, and there would be no error, but the outcome would just be different. This can be a big problem. And I know this has happened in the past.
But this concat logic is optional and only used when you have multiple inputs. This can also be disallowed. This can also be explicit instead (maybe by some new
ConcatLayer
, where for each input, you explicitly specify the axis / dim tag).
Ah yea, that's true. Okay, that is not a problem then, nevermind.
Yes, I agree .. the current way it is is super dangerous.
This discussion has come up again also in the context of returnn_common here: rwth-i6/returnn_common#17
Although returnn_common could potentially be more restrictive than we need to be here. E.g.:
- It could just not allow to specify an axis by string (no
"F"
or"T"
), and always require it to be specified by dim tag (this also needs unique dim tags then, #632). - We could introduce some
"unique_feature"
or"unique_time"
which are allowed when they are non-ambiguous and unique, according to some simple rules (e.g. the only static dim, or the only dynamic dim), and returnn_common can introduce placeholdersB
,T
andF
which map this.
Back to the discussion here on any changes of Data.feature_dim_axis
, Data.time_dim_axis
.
Breaking changes are in general possible under the conditions:
- new behavior version
- many configs would not need any changes
- for those configs which need changes: very easy and straightforward to adopt
So, we should maybe specify exactly some potential options we can do here.
One option:
E.g. Data.feature_dim_axis
returns some default axis when it is unique (e.g. the only static dim), and otherwise raise Ambiguous
or so.
Same for time_dim_axis
.
It would be disallowed to assign them.
What we need in any case here is a list of layers and other util functions which make use of these special axes. Maybe a wiki page should be created for this.
I started such a list here: https://github.com/rwth-i6/returnn/wiki/Special-axes
I still wonder whether there are still valid use cases for this. Or rather, when thinking about how usual code would look like, it is very common that there is the batch dim, and also one specific well defined feature dim. And very often also one spatial dim (time dim). E.g. imagine you write such modules like self-attention, Transformer (example: https://github.com/rwth-i6/returnn_common/pull/55), or Conformer. If we enforce dim tags to be explicit, it means you always need to explicitly pass around batch, time, feature, next to the layer ref. But then you could think of bundling batch, time, feature next to the layer ref together, and now you almost end up with what Data
is, with the special axes.
I wonder now, when do we actually have cases where e.g. the feature dim is not well defined and it is not immediately clear, non-ambiguously, what it is? There are some such cases, but I think actually not so many. If we solve those cases by making it always explicit for those cases, such that there is never any ambiguity, would it be fine then? Is this possible?
So here some cases where this is (maybe?) ambiguous:
-
SplitDimsLayer
(#596, #683, #704, #705) -
MergeDimsLayer
-
DotLayer
-
GatherLayer
-
ReduceLayer
-
UnflattenNdLayer
-
ConcatLayer
(e.g. time dim is marked to first spatial in one input, second spatial in second input)
Basically all those which somehow end up with new dims, or remove (reduce) dims. SplitDimsLayer
and MergeDimsLayer
basically do both at the same time. Or DotLayer
and GatherLayer
would combine the dims from multiple tensors. Which feature dim or time dim should it take when there are multiple?
So, maybe we can make those explicit? Or esp for the case when it would be ambiguous otherwise?
Maybe it makes also sense for some intermediate solution. I think it's fine to treat the batch dim special anyway. And the feature dim can maybe also be kept special for all cases when it is non-ambiguous, and make it explicit when there is any potential for confusion. However, time dim can be confusing. Maybe we can allow sth like unique_spatial
(unique_time
), which is defined when there are exactly 3 (implicit or explicit) dimensions and the other are batch and feature. But otherwise it would be better if it is just always explicit. Or always require the time (spatial) dim to be explicit, for consistency.
So, to summarize:
- Maybe just the time dim should be removed. Batch and feature should be kept.
- Feature should be non-ambiguous, i.e. explicitly defined after all layers where it might be ambiguous otherwise (the list of layers above).
We could also rely on the DimensionTag.kind
and allow "B"
, "T"
, "F"
exactly when it is unique according to the dim tags.
Note there is #1273, and the conclusion was to keep feature_dim
/feature_dim_axis
.
time_dim_axis
is deprecated though and by default disabled in the new Tensor
.
I guess this issue here can be closed.