heir icon indicating copy to clipboard operation
heir copied to clipboard

optimization: find a pass / implement one that eliminates unused memory writes

Open asraa opened this issue 1 year ago • 4 comments

An example of an unused memory write:

memref.store %c0, %alloc[0] : memref<1xi32>
memref.store %c1, %alloc[0] : memref<1xi32>
%v = memref.load %alloc[0] : memref<1xi32>

The first store is unused as it's overwritten after with no interleaving accesses or changes.

asraa avatar Feb 02 '24 16:02 asraa

I've been told that affine-scalrep can do this. Is the IR small enough to run affine-scalrep on?

j2kun avatar Feb 02 '24 18:02 j2kun

Hmm I just tried running affine-scalrep on it, but it didn't end up changing anything - however, i realized i can't use our passes to get the unused memory write. The issue is that the first memref.store is then affine.load'd so that it's value can be places into another memref - so until i can forward the memref.store through the affine.load, I won't be able to actually test this functionality. Gimme a day!

asraa avatar Feb 05 '24 17:02 asraa

Fixed that - now I realize the issue is that the first store is in a separated region, so the pass analysis for affine-scalrep doesn't do anything. Here's a reproducer modified from hello-world

module attributes {tf_saved_model.semantics} {
  memref.global "private" constant @__constant_1x16xi8 : memref<1x16xi8> = dense<[[-39, 59, 39, 21, 28, -32, -34, -35, 15, 27, -59, -41, 18, -35, -7, 127]]> {alignment = 64 : i64}
  memref.global "private" constant @__constant_16xi32_0 : memref<16xi32> = dense<[-729, 1954, 610, 0, 241, -471, -35, -867, 571, 581, 4260, 3943, 591, 0, -889, -5103]> {alignment = 64 : i64}
  memref.global "private" constant @__constant_16x16xi8 : memref<16x16xi8> = dense<"0xF41AED091921F424E021EFBCF7F5FA1903DCD20206F9F402FFFAEFF1EFD327E1FB27DDEBDBE4051A17FC241215EF1EE410FE14DA1CF8F3F1EFE2F309E3E9EDE3E415070B041B1AFEEB01DE21E60BEC03230A22241E2703E60324FFC011F8FCF1110CF5E0F30717E5E8EDFADCE823FB07DDFBFD0014261117E7F111EA0226040425211D0ADB1DDC2001FAE3370BF11A16EF1CE703E01602032118092ED9E5140BEA1AFCD81300C4D8ECD9FE0D1920D8D6E21FE9D7CAE2DDC613E7043E000114C7DBE71515F506D61ADC0922FE080213EF191EE209FDF314DDDA20D90FE3F9F7EEE924E629000716E21E0D23D3DDF714FA0822262109080F0BE012F47FDC58E526"> {alignment = 64 : i64}
  memref.global "private" constant @__constant_16xi32 : memref<16xi32> = dense<[0, 0, -5438, -5515, -1352, -1500, -4152, -84, 3396, 0, 1981, -5581, 0, -6964, 3407, -7217]> {alignment = 64 : i64}
  memref.global "private" constant @__constant_16x1xi8 : memref<16x1xi8> = dense<[[-9], [-54], [57], [71], [104], [115], [98], [99], [64], [-26], [127], [25], [-82], [68], [95], [86]]> {alignment = 64 : i64}
  func.func @main(%arg0: !tfhe_rust.server_key {iree.identifier = "serving_default_dense_input:0", tf_saved_model.index_path = ["dense_input"]}, %arg1: memref<1x1x8x!tfhe_rust.eui3>) -> (memref<1x!tfhe_rust.eui3> {iree.identifier = "StatefulPartitionedCall:0", tf_saved_model.index_path = ["dense_2"]}) attributes {tf_saved_model.exported_names = ["serving_default"]} {
    %c1_i32 = arith.constant 1 : i32
    %c0 = arith.constant 0 : index
    %0 = memref.get_global @__constant_16xi32 : memref<16xi32>
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<1x16x32x!tfhe_rust.eui3>
    affine.for %arg2 = 0 to 16 {
      %1 = affine.load %0[%arg2] : memref<16xi32>
      affine.for %arg3 = 0 to 32 {
        %2 = arith.index_cast %arg3 : index to i32
        %3 = arith.shli %c1_i32, %2 : i32
        %4 = arith.andi %1, %3 : i32
        %5 = arith.shrsi %4, %2 : i32
        %6 = arith.trunci %5 : i32 to i1
        %7 = tfhe_rust.create_trivial %arg0, %6 : (!tfhe_rust.server_key, i1) -> !tfhe_rust.eui3
        affine.store %7, %alloc[%c0, %arg2, %arg3] : memref<1x16x32x!tfhe_rust.eui3>
      }
    }
    affine.for %arg2 = 0 to 16 {
      %1 = affine.load %0[%arg2] : memref<16xi32>
      affine.for %arg3 = 0 to 32 {
        %2 = arith.index_cast %arg3 : index to i32
        %3 = arith.shli %c1_i32, %2 : i32
        %4 = arith.andi %1, %3 : i32
        %5 = arith.shrsi %4, %2 : i32
        %6 = arith.trunci %5 : i32 to i1
        %7 = tfhe_rust.create_trivial %arg0, %6 : (!tfhe_rust.server_key, i1) -> !tfhe_rust.eui3
        affine.store %7, %alloc[%c0, %arg2, %arg3] : memref<1x16x32x!tfhe_rust.eui3>
      }
    }
    %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<1x!tfhe_rust.eui3>
    affine.for %arg2 = 0 to 16 {
      %1 = affine.load %alloc[%c0, %arg2, %c0] : memref<1x16x32x!tfhe_rust.eui3>
      affine.store %1, %alloc_0[%c0] : memref<1x!tfhe_rust.eui3>
    }
    return %alloc_0 : memref<1x!tfhe_rust.eui3>
  }
}

asraa avatar Feb 05 '24 20:02 asraa

It does work when the loops are unrolled so the stores are in the same region:

  func.func @main() -> (memref<1x1xi32>) {
    %c1_i32 = arith.constant 1 : i32
    %c2_i32 = arith.constant 2 : i32
    %c0 = arith.constant 0 : index
    %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<1x1xi32>
    affine.store %c1_i32, %alloc_0[%c0, %c0] : memref<1x1xi32>
    affine.store %c2_i32, %alloc_0[%c0, %c0] : memref<1x1xi32>
    return %alloc_0 : memref<1x1xi32>
  }

asraa avatar Feb 05 '24 20:02 asraa