gemmini about gemmini_config_ld and gemmini_config

I found two macros, gemmini_config_ld and gemmini_config_st,they are in charyard/generators/gemmini/software/gemmini-rocc-tests/include/gemmini.h,but I don't know what they are for, and I don't see any description in the isa section of the documentation, is there any relevant documentation to describe these two macros?

Jan 06 '23 12:01 linuxlonelyeagle

I now have some new problems.The code is here. Here are my questions.

Whether the setting of the address has a special meaning or a special purpose.

      uint32_t A_acc_addr = 1 << (ADDR_LEN - 1);
      uint32_t B_acc_addr = (1 << (ADDR_LEN - 1)) | (1 << (ADDR_LEN - 2));
      uint32_t C_acc_addr = 1 << (ADDR_LEN - 1);

How does gemmini perform matrix operations? I only see mvin, mvout and config operations here, but how exactly gemmini works is not clear to me.

Jan 15 '23 12:01 linuxlonelyeagle

gemmini_config_ld is just config_mvin and gemmini_config_st is just config_mvout. (Sorry about the inconsistent naming).

Whether the setting of the address has a special meaning or a special purpose.

The memory addressing scheme is described in more detail here. The address of A, B, and C are set to access the 32-bit accumulator (rather than the 8-bit parts of the scratchpad). The address of B is set to specify that B should be added on top of A, rather than overwriting A.

How does gemmini perform matrix operations?

We describe the matmul sequence here. Let me know if that isn't descriptive enough though

Jan 21 '23 06:01 hngenc

gemmini_config_ld is just config_mvin and gemmini_config_st is just config_mvout. (Sorry about the inconsistent naming).

Whether the setting of the address has a special meaning or a special purpose.

The memory addressing scheme is described in more detail here. The address of A, B, and C are set to access the 32-bit accumulator (rather than the 8-bit parts of the scratchpad). The address of B is set to specify that B should be added on top of A, rather than overwriting A.

How does gemmini perform matrix operations?

We describe the matmul sequence here. Let me know if that isn't descriptive enough though

Thank you for your answer, many operations I have understood by studying other examples,but I still have a little doubt about this example.I think this example is just transferring A to C, and it should not perform the addition operation, a bit I didn't understand. The meshRows in Generator Parameters I don't think are clearly described, but I'm guessing they are tileRows.

Jan 21 '23 09:01 linuxlonelyeagle

Is there an example of a simple loop-instructions here?I don't know much about deep learning.I would like to use llvm for related support.

Jan 21 '23 09:01 linuxlonelyeagle

I think this example is just transferring A to C, and it should not perform the addition operation, a bit I didn't understand.

As we describe here, if the 30th bit of an accumulator address is 1, then we don't simply overwrite what was in the accumulator -- we add the new value on top of what was previously in the accumulator. So we initially move in A, and then accumulate B on top of A, and that gives us the final result: C.

Is there an example of a simple loop-instructions here?

We use the loop instruction to perform matmuls here. The commented-out code next to it represents what the loop instruction is doing, if we implemented it using Gemmini's simply mvin, mvout, preload, and compute commands.

Jan 22 '23 04:01 hngenc

I think this example is just transferring A to C, and it should not perform the addition operation, a bit I didn't understand.

As we describe here, if the 30th bit of an accumulator address is 1, then we don't simply overwrite what was in the accumulator -- we add the new value on top of what was previously in the accumulator. So we initially move in A, and then accumulate B on top of A, and that gives us the final result: C.

Is there an example of a simple loop-instructions here?

We use the loop instruction to perform matmuls here. The commented-out code next to it represents what the loop instruction is doing, if we implemented it using Gemmini's simply mvin, mvout, preload, and compute commands.

Thanks!

Jan 22 '23 13:01 linuxlonelyeagle

I have set config_ld,config_st,config_ex,are there any conditions for using matmul.compute.cumulated rs1, rs2?The result is different if accumulated is used as with compute_preload.Here is the key code in my example.The A,B,D arrays are all 2X4X4 and C array is 4x4.

***
call void @llvm.riscv.configLd(i64 4575657221409472769, i64 4)   // config_ld
  %call145 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.5)
  call void @llvm.riscv.configSt(i64 2, i64 4575657221408423940)  // config_st
  %call146 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.6)
  call void @llvm.riscv.configEx(i64 4575657221408489472, i64 281474976710656)  // config_ex
  %call147 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.7)
  call void @llvm.riscv.mvin(i64 %47, i64 1125917086711808)        // mvin a[0]
  call void @llvm.riscv.mvin(i64 %48, i64 1125917086711812)       //  mvin b[1]
  %call148 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.8)
  call void @llvm.riscv.mvin(i64 %49, i64 1125917086711816)        // mvin b[0]
  call void @llvm.riscv.mvin(i64 %50, i64 1125917086711820)        // mvin b[1]
  %call149 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.9)
  call void @llvm.riscv.mvin(i64 %51, i64 1125917086711824)        // mvin d[0]
  call void @llvm.riscv.mvin(i64 %52, i64 1125917086711828)       // mvin d[1]
  %call150 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.10)
  call void @llvm.riscv.mvin(i64 %53, i64 1125917086711832)       // mvin c
  %call151 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.11)
  call void @llvm.riscv.preload(i64 1125917086711824, i64 1125917086711832)   //  preload d[0]  c
  %call152 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.12)
  call void @llvm.riscv.computeProloaded(i64 1125917086711808, i64 1125917086711816)  // computeProloaded a[0] b[0] 
  %call153 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.13)
  call void @llvm.riscv.mvout(i64 %53, i64 1125917086711832)            // mvout
***
call void @llvm.riscv.preload(i64 1125917086711828, i64 1125917086711832)   // preload d[1] c 
  %call179 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.16)       
  call void @llvm.riscv.computeAccumulated(i64 1125917086711812, i64 1125917086711820)  // computeAccumulated a[1] b[1]
  %call180 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.17)
  call void @llvm.riscv.mvout(i64 %53, i64 1125917086711832)   // mvout 
***

I found that the final result Accumulated does not add up to the result of C in computeProloaded.I'm not sure what the problem is.I have studied this example,but to be honest, I didn't get it.I don't understand the role of preload,preload_zeros.

  if (!preload[c]) {
          gemmini_preload_zeros(out_addr);
          gemmini_compute_accumulated(A_addr + a*DIM, B_addr + b*DIM);
        } else if (preload_zeros[c]) {
          gemmini_preload_zeros(out_addr);
          gemmini_compute_preloaded(A_addr + a*DIM, B_addr + b*DIM);
        } else {
          gemmini_preload(D_addr + d*DIM, out_addr);
          gemmini_compute_preloaded(A_addr + a*DIM, B_addr + b*DIM);
        }
      }

I really hope you can help me answer this question.Thanks!

Jan 23 '23 09:01 linuxlonelyeagle

I have some questions about BANK_NUM, BANK_ROWS,ACC_ROWS.You can check here.I didn't find any information about them. My guess is that it is fixed in the gemmini's hardware. Is there any source code about this part in the gemmini and whether this part can be configured by myself.

static void tiled_matmul_auto(size_t dim_I, size_t dim_J, size_t dim_K,
        const elem_t* A, const elem_t* B,
        const void * D, void * C,
        size_t stride_A, size_t stride_B, size_t stride_D, size_t stride_C,
        scale_t A_scale_factor, scale_t B_scale_factor, scale_acc_t D_scale_factor,
        int act, acc_scale_t scale, acc_scale_t bert_scale,
        bool repeating_bias,
        bool transpose_A, bool transpose_B,
        bool full_C, bool low_D,
        uint8_t weightA,
        enum tiled_matmul_type_t tiled_matmul_type) {

#define partition_rows (BANK_NUM * BANK_ROWS / 2)
#define mats_in_partition (partition_rows / DIM)
#define mats_in_acc (ACC_ROWS / DIM)
#define max_tile_i_j ((size_t)sqrt(mats_in_acc))
#define max_tile_k (mats_in_partition / max_tile_i_j)

    // "db_" means "double-buffered"
#define db_partition_rows ((BANK_NUM * BANK_ROWS / 2) / 2)
#define db_mats_in_partition (db_partition_rows / DIM)
#define db_mats_in_acc ((ACC_ROWS / 2) / DIM)
#define db_max_tile_i_j ((size_t)sqrt(db_mats_in_acc))
#define db_max_tile_k (db_mats_in_partition / db_max_tile_i_j)

Jan 24 '23 10:01 linuxlonelyeagle

I have a problem with config_ld.Why do you use three config_ld in a row there, won't the config_ld overwrite the last one?

Jan 27 '23 08:01 linuxlonelyeagle

gemmini_config_ld is just config_mvin and gemmini_config_st is just config_mvout. (Sorry about the inconsistent naming).

Whether the setting of the address has a special meaning or a special purpose.

The memory addressing scheme is described in more detail here. The address of A, B, and C are set to access the 32-bit accumulator (rather than the 8-bit parts of the scratchpad). The address of B is set to specify that B should be added on top of A, rather than overwriting A.

How does gemmini perform matrix operations?

We describe the matmul sequence here. Let me know if that isn't descriptive enough though

I now think that the description of the documentation is really not sufficient, and I think that the part that talks about the gemmini structure may need to be added.When I was learning the example of tile_matmul_ws, I didn't understand many parts of the function before calling sp_tile_matmul_ws.Probably because I don't understand deep learning and image processing.I think You can add further to the documentation about hardware structure based on the example tile_matmul_ws.c. I would also like to learn more details of gemmini.

Jan 27 '23 08:01 linuxlonelyeagle

I don't understand the role of preload,preload_zeros

The preload commands are used to preload either a bias or weights into the spatial array. These preloaded values will remain stationary in the spatial array while you either accumulate on top of them (in the OS case) or use them to perform matmuls (in the WS case).

preload_zeros is just a convenience function which preloads zeros into the spatial array. This can be useful if you're performing an OS matmul without a bias.

I have some questions about BANK_NUM, BANK_ROWS,ACC_ROWS

These are set during hardware elaboration. They define how many scratchpad banks we have, how many SRAM rows there are in each scratchpad bank, and how many SRAM rows there are in the accumulator.

I have a problem with config_ld.Why do you use three config_ld in a row there, won't the config_ld overwrite the last one?

We have three different sets of config_ld parameters (which is why the last number in those commands differs). By default, programmers use set 0, but if they want to intersperse loads for multiple tensors (like inputs, weights, and biases), then they can set different strides or other load-parameters for each of the three tensors at once.

I now think that the description of the documentation is really not sufficient

Sorry about that; the documentation can always be improved. Have you checked out our paper or our tutorial? You might find them more useful than our README in some cases

When I was learning the example of tile_matmul_ws, I didn't understand many parts of the function before calling sp_tile_matmul_ws

We're not really expecting people to read through the code in gemmini.h to be honest -- some of it is pretty ugly and not that readable for outsiders. But if you have questions about specific lines in tiled_matmul_ws, I might be able to help.

Jan 28 '23 00:01 hngenc

I don't understand the role of preload,preload_zeros

The preload commands are used to preload either a bias or weights into the spatial array. These preloaded values will remain stationary in the spatial array while you either accumulate on top of them (in the OS case) or use them to perform matmuls (in the WS case).

preload_zeros is just a convenience function which preloads zeros into the spatial array. This can be useful if you're performing an OS matmul without a bias.

I have some questions about BANK_NUM, BANK_ROWS,ACC_ROWS

These are set during hardware elaboration. They define how many scratchpad banks we have, how many SRAM rows there are in each scratchpad bank, and how many SRAM rows there are in the accumulator.

I have a problem with config_ld.Why do you use three config_ld in a row there, won't the config_ld overwrite the last one?

We have three different sets of config_ld parameters (which is why the last number in those commands differs). By default, programmers use set 0, but if they want to intersperse loads for multiple tensors (like inputs, weights, and biases), then they can set different strides or other load-parameters for each of the three tensors at once.

I now think that the description of the documentation is really not sufficient

Sorry about that; the documentation can always be improved. Have you checked out our paper or our tutorial? You might find them more useful than our README in some cases

When I was learning the example of tile_matmul_ws, I didn't understand many parts of the function before calling sp_tile_matmul_ws

We're not really expecting people to read through the code in gemmini.h to be honest -- some of it is pretty ugly and not that readable for outsiders. But if you have questions about specific lines in tiled_matmul_ws, I might be able to help.

To be honest--the code in gemmini.h has taught me a lot.I think the best way to learn should be to learn other people's code.I am learning deep learning compiler. I would like to understand gemmini's isa. Thank you very much for your help!I will want to ask you for advice if I have questions.Thanks!

Jan 28 '23 08:01 linuxlonelyeagle

I'd like to ask you some detailed questions.I'm not quite sure what shrink , pixel_repeats and idmeans.

#define gemmini_extended5_config_ld(stride, scale, shrunk, block_mvin_stride, pixel_repeats, id) \
  ROCC_INSTRUCTION_RS1_RS2(XCUSTOM_ACC, ((uint64_t)(scale_t_to_scale_t_bits(scale)) << 32) | ((uint64_t)(block_mvin_stride) << 16) | ((uint64_t)(pixel_repeats) << 8) | ((id) << 3) | ((shrunk) << 2) | CONFIG_LD, stride, k_CONFIG) \
  printf("gemmini_config_ld %lu %lu\n",stride ,((uint64_t)(scale_t_to_scale_t_bits(scale)) << 32) | ((uint64_t)(block_mvin_stride) << 16) | ((uint64_t)(pixel_repeats) << 8) | ((id) << 3) | ((shrunk) << 2) | CONFIG_LD);

Feb 25 '23 16:02 linuxlonelyeagle

We're not really expecting people to read through the code in gemmini.h to be honest -- some of it is pretty ugly and not that readable for outsiders. But if you have questions about specific lines in tiled_matmul_ws, I might be able to help. I do not understand the following code, nor have I found the relevant information.I have used mlir to support the general directive instructions, and am now supporting the loop Instructions.Or you can send me some code of gemmini implementation, I am interested in gemmini. Although I am not very familiar with chisel.Thanks! I hope you can give me some help.

#define partition_rows (BANK_NUM * BANK_ROWS / 2)
#define mats_in_partition (partition_rows / DIM)
#define mats_in_acc (ACC_ROWS / DIM)
#define max_tile_i_j ((size_t)sqrt(mats_in_acc))
#define max_tile_k (mats_in_partition / max_tile_i_j)

    // "db_" means "double-buffered"
#define db_partition_rows ((BANK_NUM * BANK_ROWS / 2) / 2)
#define db_mats_in_partition (db_partition_rows / DIM)
#define db_mats_in_acc ((ACC_ROWS / 2) / DIM)
#define db_max_tile_i_j ((size_t)sqrt(db_mats_in_acc))
#define db_max_tile_k (db_mats_in_partition / db_max_tile_i_j)

Mar 02 '23 06:03 linuxlonelyeagle

That code is just trying to determine maximum tiling parameters based on the scratchpad size in Gemmini. It assumes that half the scratchpad will be allocated for matrix A, the other half for matrix B, and the accumulator will be used to store the matmul result (C).

If we're double-buffering, then only half the scratchpad and accumulator will be available for the tiles, since the other half will be used to compute another tile.

May 10 '23 17:05 hngenc

On an exciting note, I used MLIR to do some support for the Gemmini software part and am now trying to run a model of ftlite.

May 10 '23 17:05 linuxlonelyeagle

Also, I am now study your project systolic dsl.I should be doing some research on systolic at some point in the future.

May 10 '23 17:05 linuxlonelyeagle

On an exciting note, I used MLIR to do some support for the Gemmini software part and am now trying to run a model of ftlite.

That's great! I'm excited to see what you come up with.

Also, I am now study your project systolic dsl.I should be doing some research on systolic at some point in the future.

I hope you find it useful. It was just a small class project though, so that DSL has very few features :)

May 10 '23 21:05 hngenc

That's great! I'm excited to see what you come up with.

I don't know if you saw the email I sent you, but I have successfully run the tensorflow model using Gemmini.

May 23 '23 05:05 linuxlonelyeagle

Hi, I've been working on Gemmini since January of this year, and I've done support for Gemmini on MLIR.You can see some of the things I've done here.But I seem to be in trouble now.Here's what I want to say. When Gemmini is doing conv operations, the format of the conv's input and kernel is fixed in tiled_auto_conv.Because my job is to do support for Gemmini on MLIR.However there are many formats for conv in MLIR, you can see here.I need to convert to Gemmini's format before I can conv, which causes a big performance loss, is there a way to eliminate this loss here, in fact, can I modify tiled_auto_conv to eliminate this loss? I realize this may be difficult, but I'm wondering how difficult this is, or if there is another way to do it? Thanks!

Jul 11 '23 17:07 linuxlonelyeagle

Hmm, supporting newer formats, as you noticed, can be difficult. However, if the innermost dimension is the same, then changing the format is only a matter of changing the DRAM/L3/L2 access strides in tiled_conv_auto or in LoopConv.scala. If you do want to change the innermost dimension though, then that might end up being a more significant undertaking.

Nov 30 '23 22:11 hngenc

Hmm, supporting newer formats, as you noticed, can be difficult. However, if the innermost dimension is the same, then changing the format is only a matter of changing the DRAM/L3/L2 access strides in tiled_conv_auto or in LoopConv.scala. If you do want to change the innermost dimension though, then that might end up being a more significant undertaking.

Thank you very much for your reply. I probably don't care to do any more research on it at the moment though. I'd actually be happy to share my other current dilemmas here, I'm graduating probably next year and I'm looking for a job right now, but it's not actually easy to find a job as an AI compiler engineer.

Dec 01 '23 01:12 linuxlonelyeagle

I'm pondering whether I want to stick to this path in the future (even though doing a compiler is my dream) or go to grad school, which is actually not easy either way (honestly, we're kind of friends, so maybe I'd like to hear your opinion or some words of comfort.)

Dec 01 '23 01:12 linuxlonelyeagle

Haha, why don't you email me directly at hngenc [at] berkeley [dot] edu, and I can give you my personal thoughts about grad school vs industry

Dec 01 '23 18:12 hngenc

Haha, I've emailed you.

Dec 01 '23 19:12 linuxlonelyeagle

about gemmini_config_ld and gemmini_config_st