Getting "op exceeded stack allocation limit" from linalg_ext.fft
What happened?
The linalg_ext.fft operation outputs 2 tensors: the real part and the imaginary part of the operation. If the tensors are large enough and one of the output tensors is not used later in the program, the compile will fail with error: 'func.func' op exceeded stack allocation limit of 32768 bytes for function.. This compile error is surprising for users because the exact same program works with smaller tensor sizes, then fails to compile once the tensors are large enough, even if the size of the 1D FFT is the same.
For example, here is the last part of an IRFFT (takes a complex input and returns a real output).
%5:2 = iree_linalg_ext.fft ins(%c1, %cst_8, %cst_7 : index, tensor<1xf32>, tensor<1xf32>) outs(%4#0, %4#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%6:2 = iree_linalg_ext.fft ins(%c2, %cst_6, %cst_5 : index, tensor<2xf32>, tensor<2xf32>) outs(%5#0, %5#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%7:2 = iree_linalg_ext.fft ins(%c3, %cst_4, %cst_3 : index, tensor<4xf32>, tensor<4xf32>) outs(%6#0, %6#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%8:2 = iree_linalg_ext.fft ins(%c4, %cst_2, %cst_1 : index, tensor<8xf32>, tensor<8xf32>) outs(%7#0, %7#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
%9:2 = iree_linalg_ext.fft ins(%c5, %cst_0, %cst : index, tensor<16xf32>, tensor<16xf32>) outs(%8#0, %8#1 : tensor<32xf32>, tensor<32xf32>) : tensor<32xf32>, tensor<32xf32>
util.return %9#0 : tensor<32xf32>
Notice that it returns %9#0 and ignores %9#1 because the imaginary part is not needed. The problem is that, once the input sizes get large enough, this errors out with error: 'func.func' op exceeded stack allocation limit. The reason is that the imaginary part of that last FFT got marked as readonly. Here are what the tensors to that last FFT get turned into
%0 = hal.interface.binding.subspan layout(#pipeline_layout3) binding(0) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<32xf32>>
%1 = hal.interface.binding.subspan layout(#pipeline_layout3) binding(1) alignment(64) offset(%c256) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<32xf32>>
We've been working around this by using the imaginary output in some non-trivial way to essentially fool the compiler into not marking it readonly. I discovered that using util.optimization_barrier %9#1 will also work around the issue.
@MaheshRavishankar suggested in issue https://github.com/iree-org/iree/issues/22695 that the right fix would be vectorize the FFT operation.
@hanhanW Moving the conversation here from https://github.com/iree-org/iree/issues/22473
Repro case is attached here: irfft.zip
Steps to reproduce your issue
iree-compile --iree-hal-target-device=local --iree-hal-local-target-device-backends=llvm-cpu --iree-llvmcpu-target-cpu=host -o /dev/null irfft.mlir
What component(s) does this issue relate to?
Compiler
Version information
3.9.0 (I also tried 3.8.0 and 3.5.0)
Additional context
No response
Thanks @pstarkcdpr . Is this essentially a tracking issue. I think I also suggested using a flag to allow any stack limit. If stack limit is not an issue, then that can be a work around till we fix the issue properly by vectorizing things.
Thanks @MaheshRavishankar. I wanted to break this out as a separate issue since we are actively running into it. I added util.optimization_barrier %9#1 to our local compiler build to work around the issue. That prevents the compiler from marking the tensor as readonly and then things optimize properly. I haven't tried just suppressing the stack limit error. I imagine that would be worse than not needing large stack allocation? I'll give it a try though.
If there's a task the supersedes this (like vectorizing linag_ext.fft), then go ahead and close it. I just wanted to get it logged since it's an issue that we're using a local patch to get around, and it would be nicer to be able to use vanilla IREE.
@MaheshRavishankar Sorry for the delay. I finally got around to checking if --iree-llvmcpu-fail-on-out-of-bounds-stack-allocation=false works around the issue. On Linux, it seems to. On Windows, however, the runtime segfaults.
iree-run-module --module=irfft.local.vmfb --function=main --device=local-sync --input="@input.npy"
results in
EXEC @main
Exception thrown at 0x000001DB2C982F6A in iree-run-module.exe: 0xC0000005: Access violation writing location 0x000000D98DAC84C0.
This is stock 3.9.0, so I don't have more symbols at present. Unfortunately, our primary use-case is on Windows. I can continue to use the optimization_barrier workaround, although that requires that we maintain a custom compiler build. In other words, this isn't blocking us, but I think it's a legitimate issue since the workaround requires a compiler patch.