xla Configure layout assignment for transposes

By default, layout assignment tries to assign a layout to transposes that make them a bitcast. This layout is then propagated inside the HloComputation, which means if it does not cancel out with another transpose, there may be a copy op in some other part of the computation. This copy will later be turned back into a transpose.

In T5X, we observed that default XLA layout assignment gives us many transposes in each transformer layer, which results in perf overheads. We tried to fix this problem by manually transposing T5X network input, however, the manually inserted transpose was ignored by XLA, which also creates many transposes in each layer.

@akuegel created a PR https://github.com/openxla/xla/pull/1485 which can stop our manually inserted transpose from being ignored. It turns out that this flag really can help us get better perf. I built a unit test with one T5X encoder layer, default runtime without manual transpose is 11.02ms. Runtime with manual transpose but without Adrian's PR is 11.04ms. After I apply both manual transpose and Adrian's PR (XLA_FLAGS=--xla_gpu_transpose_to_bitcast=false), runtime becomes 10.63ms. Clearly, Adrian's PR is helpful.

BTW, you can find my unit test HLO from here https://drive.google.com/drive/u/1/folders/1MRsPJqF7M-DoFRT3Gw8kYKHSoPAKPDUi

Unfortunately, Adrian told us that his PR cannot be merged because it's controversial and would restricts what they can do in compiler. To get better perf, I file this github issue to discuss what other possible solutions that we can have.

One suggested solution is to use a no-op custom call as "barrier" so that transposes do not get propagated. However, we still do not know how to do this easily without affecting performance. Other ideas that have been proposed is allowing a metadata annotation for the transpose.

We do not have a concrete solution yet, I hope we can nail down this problem by discussing here. Thanks a lot.

Mar 10 '23 23:03 xrennvidia

Hey, I could think of a number of workarounds:

Use the optimization barrier HLO to force the layout
Use a custom-call to represent a transpose which will force the layout change
Figure out why is the layout assignment not doing the right thing there

Mar 14 '23 10:03 cheshire

I am not familiar with this, but would it also act as a fusion blocker?
The custom-call representation will be a fusion blocker, so it is not the ideal solution.
Is possibly not feasible. Layout assignment is a greedy algorithm because we cannot afford the compilation time to do anything fancy.

Another idea that was floated around was to allow to add metadata to a transpose, so that during layout assignment we can treat it in a special way.

Mar 14 '23 11:03 akuegel

cc @jprabhas @AyanmoI @nouiz

Mar 14 '23 21:03 xrennvidia

Another idea that was floated around was to allow to add metadata to a transpose, so that during layout assignment we can treat it in a special way.

Relying on metadata for compiler optimizations is never a good idea though.

Mar 14 '23 21:03 AyanmoI

I think we need a no-optimization region mechanism. This will happens in other cases too. What do you think of that?

Mar 14 '23 22:03 nouiz

I think we need a no-optimization region mechanism

Makes sense, but the optimization barrier is meant to achieve that, right?

Mar 25 '23 11:03 cheshire

IMO using a no-optimization region may be an overkill because we do want other optimizations to kick in here , including fusion. All we want is the layout of this op to be preserved, so adding a n attribute or a metadata to indicate that the layout of the op must be preserved by layout assignment seems more like what we'd want.

Mar 27 '23 15:03 jprabhas

FYI: https://github.com/openxla/xla/commit/7bb072f7ee224488a0c743f68d5082e988b2addf added the possibility to specify an option to preserve the layout as part of metadata

Jul 10 '23 10:07 akuegel

xla xla copied to clipboard

Configure layout assignment for transposes

xla
xla copied to clipboard