iree icon indicating copy to clipboard operation
iree copied to clipboard

Getting "One or more operations with large vector sizes were found" error from linalg_ext.fft

Open pstarkcdpr opened this issue 2 months ago • 9 comments

What happened?

rfft_repro.zip

I have a simple repro case where linalg_ext.fft causes the following compile error:

rfft_repro.mlir:25:12: error: One or more operations with large vector sizes (32768 bytes) were found:

See rfft_repro.mlir.

I tried tweaking the code a little bit and found that changing %cst_5 from arith.constant to tensor.empty makes the compile error go away for some reason (I believe passing in a tensor argument instead of using arith.constant also achieves this). See rfft_repro1.mlir for this case.

While trying to make an even smaller repro, I found a 10 line file that compiled (see rfft_repro2.mlir, also quoted below):

module @rfft {
  util.func public @main(%arg0: tensor<128x153x8xf32>, %arg1: tensor<128x153x8xf32>) -> (tensor<128x153x8xf32>, tensor<128x153x8xf32>) {
    %cst1 = arith.constant 1 : index
    %cst_3 = arith.constant dense<0.000000e+00> : tensor<1xf32>
    %cst_4 = arith.constant dense<1.000000e+00> : tensor<1xf32>
    %0:2 = iree_linalg_ext.fft ins(%cst1, %cst_3, %cst_4: index, tensor<1xf32>, tensor<1xf32>) outs(%arg0, %arg1: tensor<128x153x8xf32>, tensor<128x153x8xf32>) : tensor<128x153x8xf32>, tensor<128x153x8xf32>
    util.return %0#0, %0#1 : tensor<128x153x8xf32>, tensor<128x153x8xf32>
  }
}

but it seemed brittle. Small perturbations caused issues. Not sure if they are related to the main issue or not, but returning only a single tensor %0#0 caused a op exceeded stack allocation limit error (see rfft_repro3.mlir). Skipping the second and third parameters to fft (they are supposed to be optional) caused a compiler crash (see rfft_repro4.mlir).

I'm primarily concerned with the first case, but thought that I'd share some of the investigation I did in case it's useful.

The above tests were run on 3.8.0, although I originally saw the issue in 3.5.0 (the version we primarily use), and also tried 3.6.0.

Steps to reproduce your issue

iree-compile --iree-hal-target-device=local --iree-hal-local-target-device-backends=llvm-cpu --iree-llvmcpu-target-cpu=host -o /dev/null rfft_repro.mlir

What component(s) does this issue relate to?

Compiler

Version information

v3.8.0 (also tested 3.6.0 and 3.5.0)

Additional context

No response

pstarkcdpr avatar Oct 30 '25 05:10 pstarkcdpr

I don't know what the actual setup is, but passing input arguments to outs is usually problematic because they are readonly. We can fix it like https://github.com/iree-org/iree/pull/21651 but the PR was only for linalg ops. PR is welcome.

hanhanW avatar Oct 30 '25 06:10 hanhanW

I wondered about that, but I'm not very familiar with MLIR files. Do you know what good linalg code would look like here (sorry if that's in the linked PR -- it's late here so I haven't had time to look)? This originally came from the following StableHLO

module @jit_rfft {
  func.func public @main(%arg0: tensor<128x153x8xf32>) -> (tensor<128x153x5xcomplex<f32>>) {
    %0 = stablehlo.fft %arg0, type =  RFFT, length = [8] : (tensor<128x153x8xf32>) -> tensor<128x153x5xcomplex<f32>>
    return %0 : tensor<128x153x5xcomplex<f32>>
  }
}

Once I know what the resulting linalg should look like, I think that I can modify the StableHLO lowering to generate that. I'm reasonably familiar with that piece of code because I recently implemented a bunch of missing StableHLO FFT operations locally.

pstarkcdpr avatar Oct 30 '25 06:10 pstarkcdpr

You can try using tensor.empty() in outs, as they will result in buffer allocations on host side, and they should be either writeonly or readwrite tensors. The fft support was implemented long time ago by me, and I won't be surprise if something is not working today.

hanhanW avatar Oct 30 '25 06:10 hanhanW

Oh. I wondered why the input tensors to linalg_ext.fft were in outs. I guess it's an in-place FFT. I modified the StableHLO lowering code to use a new tensor initialized with zeroes instead of an arith.constant and that fixed this particular issue. I'm still getting some alloca issues deeper in the compilation in some cases, but I'll dig into that and see if it's a different issue.

pstarkcdpr avatar Oct 30 '25 23:10 pstarkcdpr

Hi @hanhanW. I've been looking into this more. I worked around the problem with inputs, now I just have a problem with outputs. I mentioned this above, but the trouble happens when we don't use all the outputs of linalg_ext.fft. This function returns 2 tensors (real and imaginary). I only want the real part, so the imaginary part just gets ignored. For example:

    %5:2 = iree_linalg_ext.fft ins(%c1, %cst_8, %cst_7 : index, tensor<1xf32>, tensor<1xf32>) outs(%4#0, %4#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    %6:2 = iree_linalg_ext.fft ins(%c2, %cst_6, %cst_5 : index, tensor<2xf32>, tensor<2xf32>) outs(%5#0, %5#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    %7:2 = iree_linalg_ext.fft ins(%c3, %cst_4, %cst_3 : index, tensor<4xf32>, tensor<4xf32>) outs(%6#0, %6#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    %8:2 = iree_linalg_ext.fft ins(%c4, %cst_2, %cst_1 : index, tensor<8xf32>, tensor<8xf32>) outs(%7#0, %7#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    %9:2 = iree_linalg_ext.fft ins(%c5, %cst_0, %cst : index, tensor<16xf32>, tensor<16xf32>) outs(%8#0, %8#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    util.return %9#0 : tensor<128x153x32xf32>

Notice that %9#1 is not referenced. This causes error: 'func.func' op exceeded stack allocation limit of 32768 bytes for function.

I can see the reason by looking at stage 9 (execuable-configurations). The first few calls to linalg.fft look like

%0 = hal.interface.binding.subspan layout(#pipeline_layout2) binding(0) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>>
%1 = hal.interface.binding.subspan layout(#pipeline_layout2) binding(1) alignment(64) offset(%c5013504) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>>
%2 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [128, 153, 32], strides = [1, 1, 1] : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>> -> tensor<128x153x32xf32>
%3 = iree_tensor_ext.dispatch.tensor.load %1, offsets = [0, 0, 0], sizes = [128, 153, 32], strides = [1, 1, 1] : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>> -> tensor<128x153x32xf32>
%4:2 = iree_linalg_ext.fft {lowering_config = #config4} ins(%c4, %cst, %cst_0 : index, tensor<8xf32>, tensor<8xf32>) outs(%2, %3 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
iree_tensor_ext.dispatch.tensor.store %4#0, %0, offsets = [0, 0, 0], sizes = [128, 153, 32], strides = [1, 1, 1] : tensor<128x153x32xf32> -> !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>>
iree_tensor_ext.dispatch.tensor.store %4#1, %1, offsets = [0, 0, 0], sizes = [128, 153, 32], strides = [1, 1, 1] : tensor<128x153x32xf32> -> !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>>
return

Notice that the output tensors are readwrite and there are 2 tensor.stores at the end.

The final call to linalg.fft however looks like

%0 = hal.interface.binding.subspan layout(#pipeline_layout3) binding(0) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>>
%1 = hal.interface.binding.subspan layout(#pipeline_layout3) binding(1) alignment(64) offset(%c5013504) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<128x153x32xf32>>
%2 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [128, 153, 32], strides = [1, 1, 1] : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>> -> tensor<128x153x32xf32>
%3 = iree_tensor_ext.dispatch.tensor.load %1, offsets = [0, 0, 0], sizes = [128, 153, 32], strides = [1, 1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<128x153x32xf32>> -> tensor<128x153x32xf32>
%4:2 = iree_linalg_ext.fft {lowering_config = #config4} ins(%c5, %cst, %cst_0 : index, tensor<16xf32>, tensor<16xf32>) outs(%2, %3 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
iree_tensor_ext.dispatch.tensor.store %4#0, %0, offsets = [0, 0, 0], sizes = [128, 153, 32], strides = [1, 1, 1] : tensor<128x153x32xf32> -> !iree_tensor_ext.dispatch.tensor<readwrite:tensor<128x153x32xf32>>
return

Notice that there's only one tensor.store and %1 is readonly. If I manually fix this, then everything works.

If I use the imaginary tensor in some non-trivial way, that works around the issue. But I'd prefer to not to have hacks in our code if possible. Do you know the right way to go about making sure that the inputs (passed into outs) for linalg.fft remain readwrite even if they are not used further downstream?

pstarkcdpr avatar Nov 01 '25 00:11 pstarkcdpr

I have a workaround for this issue. The problem shows up particularly with IRFFT since linalg_ext.fft gives back both a real and imaginary tensors, but the imaginary part is ignored (by definition). This leads to it being marked as readonly and failing to compile passed stage 9 executable-configurations. While probably not the best solution, why not just force the unused tensor to not be optimized using util.optimization_barrier? The following linalg code compiles and runs correctly:

    %5:2 = iree_linalg_ext.fft ins(%c1, %cst_8, %cst_7 : index, tensor<1xf32>, tensor<1xf32>) outs(%4#0, %4#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    %6:2 = iree_linalg_ext.fft ins(%c2, %cst_6, %cst_5 : index, tensor<2xf32>, tensor<2xf32>) outs(%5#0, %5#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    %7:2 = iree_linalg_ext.fft ins(%c3, %cst_4, %cst_3 : index, tensor<4xf32>, tensor<4xf32>) outs(%6#0, %6#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    %8:2 = iree_linalg_ext.fft ins(%c4, %cst_2, %cst_1 : index, tensor<8xf32>, tensor<8xf32>) outs(%7#0, %7#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    %9:2 = iree_linalg_ext.fft ins(%c5, %cst_0, %cst : index, tensor<16xf32>, tensor<16xf32>) outs(%8#0, %8#1 : tensor<128x153x32xf32>, tensor<128x153x32xf32>) : tensor<128x153x32xf32>, tensor<128x153x32xf32>
    util.optimization_barrier %9#1: tensor<128x153x32xf32>
    util.return %9#0 : tensor<128x153x32xf32>

@MaheshRavishankar for visibility.

pstarkcdpr avatar Nov 26 '25 03:11 pstarkcdpr

Sorry that I missed the issue. I'm confused now because we don't have vectorization support for FFT. I suspect that it is because of the copy op that is generated by bufferization.

What error are you seeing? Is it large vector sizes or large allocation buffer error?

To handle the readonly issue, I'd suggest doing something like https://github.com/iree-org/iree/pull/21651 for FFT op that rewrite the op a bit in codegen. So it takes tensor.empty() or linalg.fill in init tensor. I'm not pretty sure which one it should be as I don't work on fft op for a long time.

However, @MaheshRavishankar recently has a concern: https://github.com/iree-org/iree/issues/19546#issuecomment-3565141576 I'll chat more with Mahesh after Thanksgiving, i.e., later next week.

hanhanW avatar Nov 26 '25 03:11 hanhanW

Sorry for the confusion @hanhanW. Thanks to your original comment, the original issue that I logged here is fixed pending a merge (see issue https://github.com/iree-org/iree/issues/22695). This fix exposed another, related problem. I see now that it's a different issue, so have created a new issue for it. Let's move the conversation over there: https://github.com/iree-org/iree/issues/22776

pstarkcdpr avatar Nov 26 '25 22:11 pstarkcdpr

This issue is fixed in PR https://github.com/iree-org/iree/issues/22695

pstarkcdpr avatar Dec 05 '25 00:12 pstarkcdpr