returnn Data special axes (feature, time) should be removed

A Data object has time_dim_axis and feature_dim_axis. There are many automatic rules how to automatically define and set them. But these rules are somewhat arbitrary and not always straight-forward.

Many layers use these special axes to perform certain operations on it. (E.g. LinearLayer uses the feature axis. RecLayer uses both time and feature. Etc.)

The user can be sometimes confused or not notice when feature or time is set in a specific way.

So this issue is the proposal to completely remove the special axes in Data.

Related: DimensionTag also distinguishes between feature and spatial. But this distinction is completely arbitrary. There also can be multiple spatial and and multiple feature dims. I'm not sure if it is really helpful. Maybe this distinction should also be removed?

I'm not sure yet about how to handle this in all the layers which make use of the special axes. Those layers can all accept some axis option to make it explicit. However, this has problems:

We do not want to break old configs. So somehow the old behavior must still work.
Maybe the user code becomes to complicated to write? Writing linear always with specifying the axis?

In some cases, we maybe can infer some reasonable default. Similar as feature_dim_axis already behaves when it is set to NotSpecified. But this should not be too arbitrary, and we have to be careful here. Also we still might break old configs this way.

Or keep the DimensionTag distinction, and infer the default time/feature from it, when it is unique? And if it is not unique, it would throw an error, and require the user to specify it explicitly.

Aug 24 '21 08:08 albertz

Note, related is #391, which also includes some comments on this. Also related are the recent changes on dim tags via #579.

Aug 24 '21 08:08 albertz

One thing I also don't like about this currently is that it requires a lot of extra code when writing layers: E.g. we currently have this spaghetti moster in GatherLayer, just because feature_dim_axis may be NotSpecified:

    # feature_dim_axis needs to be handled differently if it is NotSpecified
    if (input_data.feature_dim_axis_or_unspecified is NotSpecified and
            position_data.feature_dim_axis_or_unspecified is NotSpecified):
      out_type["feature_dim_axis"] = NotSpecified
    elif input_data.feature_dim_axis_or_unspecified is NotSpecified:
      assert position_data.feature_dim_axis_or_unspecified is not NotSpecified
      if position_data.feature_dim_axis is not None:
        out_type["feature_dim_axis"] = cls._translate_position_axis(
          position_data.feature_dim_axis, old_gather_axis, common_axes_position, position_axes)
      else:
        out_type["feature_dim_axis"] = NotSpecified
    elif position_data.feature_dim_axis_or_unspecified is NotSpecified:
      assert input_data.feature_dim_axis_or_unspecified is not NotSpecified
      if input_data.feature_dim_axis is not None:
        out_type["feature_dim_axis"] = cls._translate_input_axis(
          input_data.feature_dim_axis, old_gather_axis, common_axes_input, input_axes, position_axes)
      else:
        out_type["feature_dim_axis"] = NotSpecified
    else:
      assert input_data.feature_dim_axis_or_unspecified is not NotSpecified
      assert position_data.feature_dim_axis_or_unspecified is not NotSpecified
      pass  # keep the logic as before

(just wanted to add this to the discussion. I don't quite know about how to solve this without breaking configs / the other problems you mentioned.)

Aug 26 '21 12:08 Zettelkasten

One thing I also don't like about this currently is that it requires a lot of extra code when writing layers: E.g. we currently have this spaghetti moster in GatherLayer, just because feature_dim_axis may be NotSpecified:

Yep.

We have the problem though that we need to keep this logic around to not break old configs. Maybe we can encapsulate it a bit better though, such that it is not too much mixed with the main logic of the layer itself.

We don't need to introduce such complex logic for new layers, though. If some new layer changes the dimensions / shape or so in any way, it could simply discard the special axes (thus it would fall back to default behavior).

Aug 26 '21 13:08 albertz

Actually maybe this issue is not really about removing this (which I think is not possible without breaking old configs) but just about disabling it with a new behavior version, and deprecating it.

Aug 26 '21 13:08 albertz

What we need in any case here is a list of layers and other util functions which make use of these special axes. Maybe a wiki page should be created for this.

And then each layer on this list must get additional support to explicitly specify the axis (e.g. by dim tag). This must happen before we can proceed with this issue here.

Aug 26 '21 13:08 albertz

Actually maybe this issue is not really about removing this (which I think is not possible without breaking old configs) but just about disabling it with a new behavior version, and deprecating it.

I'm not sure if we wouldn't want to keep it in some form. Actually, also _ConcatInputLayer always operates on the feature dim currently (and therefore almost every layer). That would be too annoying to always specify, I think we still need some sort of feature_dim_axis.

For the time axis, yes, let's remove it and make it explicit always (maybe keep T as a shorthand when it is unique).

Aug 26 '21 13:08 Zettelkasten

Actually, also _ConcatInputLayer always operates on the feature dim currently (and therefore almost every layer). That would be too annoying to always specify, I think we still need some sort of feature_dim_axis.

But this concat logic is optional and only used when you have multiple inputs. This can also be disallowed. This can also be explicit instead (maybe by some new ConcatLayer, where for each input, you explicitly specify the axis / dim tag).

The reasoning for this issue here is that there can be cases where it is not clear / straightforward what axis is actually declared as feature (or time). Not being straightforward is basically implied because the logic for determining the feature_dim_axis is so complex.

We want to avoid that unexpected behavior can happen. Even more so when this can happen without error, i.e. being unnoticed then. E.g. if the feature axis is set to another axis than the user expects, the concat would still work fine, and LinearLayer or other layers would still operate perfectly fine, and there would be no error, but the outcome would just be different. This can be a big problem. And I know this has happened in the past.

Aug 26 '21 13:08 albertz

But this concat logic is optional and only used when you have multiple inputs. This can also be disallowed. This can also be explicit instead (maybe by some new ConcatLayer, where for each input, you explicitly specify the axis / dim tag).

Ah yea, that's true. Okay, that is not a problem then, nevermind.

Yes, I agree .. the current way it is is super dangerous.

Aug 26 '21 15:08 Zettelkasten

This discussion has come up again also in the context of returnn_common here: rwth-i6/returnn_common#17

Although returnn_common could potentially be more restrictive than we need to be here. E.g.:

It could just not allow to specify an axis by string (no "F" or "T"), and always require it to be specified by dim tag (this also needs unique dim tags then, #632).
We could introduce some "unique_feature" or "unique_time" which are allowed when they are non-ambiguous and unique, according to some simple rules (e.g. the only static dim, or the only dynamic dim), and returnn_common can introduce placeholders B, T and F which map this.

Nov 09 '21 08:11 albertz

Back to the discussion here on any changes of Data.feature_dim_axis, Data.time_dim_axis.

Breaking changes are in general possible under the conditions:

new behavior version
many configs would not need any changes
for those configs which need changes: very easy and straightforward to adopt

So, we should maybe specify exactly some potential options we can do here.

Nov 09 '21 08:11 albertz

One option:

E.g. Data.feature_dim_axis returns some default axis when it is unique (e.g. the only static dim), and otherwise raise Ambiguous or so. Same for time_dim_axis.

It would be disallowed to assign them.

Nov 09 '21 08:11 albertz

What we need in any case here is a list of layers and other util functions which make use of these special axes. Maybe a wiki page should be created for this.

I started such a list here: https://github.com/rwth-i6/returnn/wiki/Special-axes

Nov 09 '21 09:11 albertz

I still wonder whether there are still valid use cases for this. Or rather, when thinking about how usual code would look like, it is very common that there is the batch dim, and also one specific well defined feature dim. And very often also one spatial dim (time dim). E.g. imagine you write such modules like self-attention, Transformer (example: https://github.com/rwth-i6/returnn_common/pull/55), or Conformer. If we enforce dim tags to be explicit, it means you always need to explicitly pass around batch, time, feature, next to the layer ref. But then you could think of bundling batch, time, feature next to the layer ref together, and now you almost end up with what Data is, with the special axes.

I wonder now, when do we actually have cases where e.g. the feature dim is not well defined and it is not immediately clear, non-ambiguously, what it is? There are some such cases, but I think actually not so many. If we solve those cases by making it always explicit for those cases, such that there is never any ambiguity, would it be fine then? Is this possible?

So here some cases where this is (maybe?) ambiguous:

SplitDimsLayer (#596, #683, #704, #705)
MergeDimsLayer
DotLayer
GatherLayer
ReduceLayer
UnflattenNdLayer
ConcatLayer (e.g. time dim is marked to first spatial in one input, second spatial in second input)

Basically all those which somehow end up with new dims, or remove (reduce) dims. SplitDimsLayer and MergeDimsLayer basically do both at the same time. Or DotLayer and GatherLayer would combine the dims from multiple tensors. Which feature dim or time dim should it take when there are multiple?

So, maybe we can make those explicit? Or esp for the case when it would be ambiguous otherwise?

Nov 15 '21 22:11 albertz

Maybe it makes also sense for some intermediate solution. I think it's fine to treat the batch dim special anyway. And the feature dim can maybe also be kept special for all cases when it is non-ambiguous, and make it explicit when there is any potential for confusion. However, time dim can be confusing. Maybe we can allow sth like unique_spatial (unique_time), which is defined when there are exactly 3 (implicit or explicit) dimensions and the other are batch and feature. But otherwise it would be better if it is just always explicit. Or always require the time (spatial) dim to be explicit, for consistency.

So, to summarize:

Maybe just the time dim should be removed. Batch and feature should be kept.
Feature should be non-ambiguous, i.e. explicitly defined after all layers where it might be ambiguous otherwise (the list of layers above).

Nov 16 '21 22:11 albertz

We could also rely on the DimensionTag.kind and allow "B", "T", "F" exactly when it is unique according to the dim tags.

Nov 23 '21 19:11 albertz

Note there is #1273, and the conclusion was to keep feature_dim/feature_dim_axis.

time_dim_axis is deprecated though and by default disabled in the new Tensor.

I guess this issue here can be closed.

Apr 13 '23 12:04 albertz

returnn returnn copied to clipboard

Data special axes (feature, time) should be removed

returnn
returnn copied to clipboard