iree [LLVMGPU] Don't create copies when subviews are equivalent

The LLVMGPU bufferization copy function inserts barriers for shared memory copies. Some of the copies copy to and from equivalent subviews, which gets folded away by canonicalizations. When this happens in LLVMGPU, the barriers do not go away, so this PR prevents the copy from being created in the first place for such cases.

Aug 16 '24 16:08 Max191

This seems.like something bufferization should be able to figure.out by itself. Seems like something else is going wrong?

This comes from bufferization of tensor.insert_slice:

IR after bufferizing: vector.transfer_write
ImplicitTypeIDRegistry::lookupOrInsert(mlir::vector::detail::TransferWriteOpGenericAdaptorBase::Properties)
func.func @no_copy_on_equal_subviews(%arg0: index) {
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer>, #hal.pipeline.binding<storage_buffer>]>) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %0, 64 : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer>, #hal.pipeline.binding<storage_buffer>]>) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x2x32xf32>>
  %2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer>, #hal.pipeline.binding<storage_buffer>]>) binding(1) alignment(64) offset(%c0) : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %2, 64 : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  %3 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer>, #hal.pipeline.binding<storage_buffer>]>) binding(1) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<1x2x32xf32>>
  %4 = bufferization.to_tensor %2 : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  %5 = bufferization.to_tensor %0 : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  %alloc = memref.alloc() : memref<1x2x32xf32, #gpu.address_space<workgroup>>
  %6 = bufferization.to_tensor %alloc : memref<1x2x32xf32, #gpu.address_space<workgroup>>
  %7 = scf.forall (%arg1, %arg2) in (2, 16) shared_outs(%arg3 = %4) -> (tensor<1x2x32xf32>) {
    %8 = affine.apply affine_map<(d0) -> (d0 * 2)>(%arg2)
    %subview = memref.subview %alloc[0, %arg1, %8] [1, 1, 2] [1, 1, 1] : memref<1x2x32xf32, #gpu.address_space<workgroup>> to memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    %9 = bufferization.to_tensor %subview : memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    %10 = vector.transfer_read %0[%c0, %arg1, %8], %cst {in_bounds = [true]} : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>, vector<2xf32>
    vector.transfer_write %10, %subview[%c0, %arg1, %8] {in_bounds = [true]} : vector<2xf32>, memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    %11 = bufferization.to_tensor %subview : memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    // Bufferizing %inserted_slice next
    %inserted_slice = tensor.insert_slice %11 into %6[0, %arg1, %8] [1, 1, 2] [1, 1, 1] : tensor<1x1x2xf32> into tensor<1x2x32xf32>
    %extracted_slice = tensor.extract_slice %inserted_slice[%c0, %arg1, %8] [1, 1, 2] [1, 1, 1] {in_bounds = [true]} : tensor<1x2x32xf32> to tensor<1x1x2xf32>
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %extracted_slice into %arg3[0, %arg1, %8] [1, 1, 2] [1, 1, 1] : tensor<1x1x2xf32> into tensor<1x2x32xf32>
    }
  } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
  flow.dispatch.tensor.store %7, %3, offsets = [%c0, %c0, %c0], sizes = [1, 2, 32], strides = [1, 1, 1] : tensor<1x2x32xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x2x32xf32>>
  return
}
//===-------------------------------------------===//
//===-------------------------------------------===//
IR after bufferizing: tensor.insert_slice
func.func @no_copy_on_equal_subviews(%arg0: index) {
  %c0 = arith.constant 0 : index
  %cst = arith.constant 0.000000e+00 : f32
  %0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer>, #hal.pipeline.binding<storage_buffer>]>) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %0, 64 : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer>, #hal.pipeline.binding<storage_buffer>]>) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x2x32xf32>>
  %2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer>, #hal.pipeline.binding<storage_buffer>]>) binding(1) alignment(64) offset(%c0) : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %2, 64 : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  %3 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer>, #hal.pipeline.binding<storage_buffer>]>) binding(1) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<1x2x32xf32>>
  %4 = bufferization.to_tensor %2 : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  %5 = bufferization.to_tensor %0 : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>
  %alloc = memref.alloc() : memref<1x2x32xf32, #gpu.address_space<workgroup>>
  %6 = bufferization.to_tensor %alloc : memref<1x2x32xf32, #gpu.address_space<workgroup>>
  %7 = scf.forall (%arg1, %arg2) in (2, 16) shared_outs(%arg3 = %4) -> (tensor<1x2x32xf32>) {
    %8 = affine.apply affine_map<(d0) -> (d0 * 2)>(%arg2)
    %subview = memref.subview %alloc[0, %arg1, %8] [1, 1, 2] [1, 1, 1] : memref<1x2x32xf32, #gpu.address_space<workgroup>> to memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    %9 = bufferization.to_tensor %subview : memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    %10 = vector.transfer_read %0[%c0, %arg1, %8], %cst {in_bounds = [true]} : memref<1x2x32xf32, #hal.descriptor_type<storage_buffer>>, vector<2xf32>
    vector.transfer_write %10, %subview[%c0, %arg1, %8] {in_bounds = [true]} : vector<2xf32>, memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    %11 = bufferization.to_tensor %subview : memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    // Bufferized insert_slice:
    %subview_0 = memref.subview %alloc[0, %arg1, %8] [1, 1, 2] [1, 1, 1] : memref<1x2x32xf32, #gpu.address_space<workgroup>> to memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    gpu.barrier
    memref.copy %subview, %subview_0 {__internal_linalg_transform__ = "copy_to_workgroup_memory"} : memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>> to memref<1x1x2xf32, strided<[64, 32, 1], offset: ?>, #gpu.address_space<workgroup>>
    gpu.barrier
    // ---
    %12 = bufferization.to_tensor %alloc : memref<1x2x32xf32, #gpu.address_space<workgroup>>
    %extracted_slice = tensor.extract_slice %12[%c0, %arg1, %8] [1, 1, 2] [1, 1, 1] {in_bounds = [true]} : tensor<1x2x32xf32> to tensor<1x1x2xf32>
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %extracted_slice into %arg3[0, %arg1, %8] [1, 1, 2] [1, 1, 1] : tensor<1x1x2xf32> into tensor<1x2x32xf32>
    }
  } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
  flow.dispatch.tensor.store %7, %3, offsets = [%c0, %c0, %c0], sizes = [1, 2, 32], strides = [1, 1, 1] : tensor<1x2x32xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x2x32xf32>>
  return
}

Since there are bufferization ops to mix tensor types and memref types, the insert_slice creates a different (but equivalent) subview for the destination than what is passed as the source buffer. Then it calls the bufferization copyFn to copy the source buffer to the dest buffer, which it should do.

We could add this equivalent subviews check in the insert_slice bufferize implementation, but I think it actually makes sense to do it in the copyFn, since it will catch other similar cases. Alternatively, we could add it to the BufferizationOptions::createMemCpy function, which would apply across all uses of the bufferization interface, regardless of the copyFn passed in bufferization options. What do you think @MaheshRavishankar?

Aug 29 '24 15:08 Max191

Couple of questions...

Could we fold the insert_slices before bufferization. Seems like we should be able to especially since its a very strict use pattern.
I see an in_bounds = true on the insert slice. All insert slices are in_bounds. So anything which is creating an in_bounds = false is already in invalid IR territory.

Aug 29 '24 16:08 MaheshRavishankar

1. Could we fold the insert_slices before bufferization. Seems like we should be able to especially since its a very strict use pattern.

The way this usually shows up is in the TileAndFuse pipeline after scf.forall fusion. Tensor semantics require the insert_slice to be there, so that there is a full tensor to extract from. I don't see any way to fold this away before bufferization.

2. I see an `in_bounds = true` on the insert slice. All insert slices are in_bounds. So anything which is creating an `in_bounds = false` is already in invalid IR territory.

I'm not sure what this is getting at. I don't see any in_bounds attribute on the insert_slice?

Aug 29 '24 16:08 Max191

1. Could we fold the insert_slices before bufferization. Seems like we should be able to especially since its a very strict use pattern.
The way this usually shows up is in the TileAndFuse pipeline after scf.forall fusion. Tensor semantics require the insert_slice to be there, so that there is a full tensor to extract from. I don't see any way to fold this away before bufferization.

I think this is the bug.. I have seen similar bugs on other parts of the compilation stack too... this shows a bit of an over-reliance on exact shape of extract_slices to connect everything. That is going to be hard to support.

2. I see an `in_bounds = true` on the insert slice. All insert slices are in_bounds. So anything which is creating an `in_bounds = false` is already in invalid IR territory.
I'm not sure what this is getting at. I don't see any in_bounds attribute on the insert_slice?

Sorry, I meant the in_bounds on extract_slices. All extract_slices are necessarily in-bounds as well.

Aug 29 '24 16:08 MaheshRavishankar

closing because of https://github.com/iree-org/iree/pull/18444

Sep 05 '24 22:09 Max191