about gemmini_config_ld and gemmini_config_st
I found two macros, gemmini_config_ld and gemmini_config_st,they are in charyard/generators/gemmini/software/gemmini-rocc-tests/include/gemmini.h,but I don't know what they are for, and I don't see any description in the isa section of the documentation, is there any relevant documentation to describe these two macros?
I now have some new problems.The code is here. Here are my questions.
- Whether the setting of the address has a special meaning or a special purpose.
uint32_t A_acc_addr = 1 << (ADDR_LEN - 1);
uint32_t B_acc_addr = (1 << (ADDR_LEN - 1)) | (1 << (ADDR_LEN - 2));
uint32_t C_acc_addr = 1 << (ADDR_LEN - 1);
- How does gemmini perform matrix operations? I only see mvin, mvout and config operations here, but how exactly gemmini works is not clear to me.
gemmini_config_ld is just config_mvin and gemmini_config_st is just config_mvout. (Sorry about the inconsistent naming).
Whether the setting of the address has a special meaning or a special purpose.
The memory addressing scheme is described in more detail here. The address of A, B, and C are set to access the 32-bit accumulator (rather than the 8-bit parts of the scratchpad). The address of B is set to specify that B should be added on top of A, rather than overwriting A.
How does gemmini perform matrix operations?
We describe the matmul sequence here. Let me know if that isn't descriptive enough though
gemmini_config_ldis justconfig_mvinandgemmini_config_stis justconfig_mvout. (Sorry about the inconsistent naming).Whether the setting of the address has a special meaning or a special purpose.
The memory addressing scheme is described in more detail here. The address of A, B, and C are set to access the 32-bit accumulator (rather than the 8-bit parts of the scratchpad). The address of B is set to specify that B should be added on top of A, rather than overwriting A.
How does gemmini perform matrix operations?
We describe the matmul sequence here. Let me know if that isn't descriptive enough though
Thank you for your answer, many operations I have understood by studying other examples,but I still have a little doubt about this example.I think this example is just transferring A to C, and it should not perform the addition operation, a bit I didn't understand.
The meshRows in Generator Parameters I don't think are clearly described, but I'm guessing they are tileRows.
Is there an example of a simple loop-instructions here?I don't know much about deep learning.I would like to use llvm for related support.
I think this example is just transferring A to C, and it should not perform the addition operation, a bit I didn't understand.
As we describe here, if the 30th bit of an accumulator address is 1, then we don't simply overwrite what was in the accumulator -- we add the new value on top of what was previously in the accumulator. So we initially move in A, and then accumulate B on top of A, and that gives us the final result: C.
Is there an example of a simple loop-instructions here?
We use the loop instruction to perform matmuls here. The commented-out code next to it represents what the loop instruction is doing, if we implemented it using Gemmini's simply mvin, mvout, preload, and compute commands.
I think this example is just transferring A to C, and it should not perform the addition operation, a bit I didn't understand.
As we describe here, if the 30th bit of an accumulator address is 1, then we don't simply overwrite what was in the accumulator -- we add the new value on top of what was previously in the accumulator. So we initially move in A, and then accumulate B on top of A, and that gives us the final result: C.
Is there an example of a simple loop-instructions here?
We use the loop instruction to perform matmuls here. The commented-out code next to it represents what the loop instruction is doing, if we implemented it using Gemmini's simply
mvin,mvout,preload, andcomputecommands.
Thanks!
I have set config_ld,config_st,config_ex,are there any conditions for using matmul.compute.cumulated rs1, rs2?The result is different if accumulated is used as with compute_preload.Here is the key code in my example.The A,B,D arrays are all 2X4X4 and C array is 4x4.
***
call void @llvm.riscv.configLd(i64 4575657221409472769, i64 4) // config_ld
%call145 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.5)
call void @llvm.riscv.configSt(i64 2, i64 4575657221408423940) // config_st
%call146 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.6)
call void @llvm.riscv.configEx(i64 4575657221408489472, i64 281474976710656) // config_ex
%call147 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.7)
call void @llvm.riscv.mvin(i64 %47, i64 1125917086711808) // mvin a[0]
call void @llvm.riscv.mvin(i64 %48, i64 1125917086711812) // mvin b[1]
%call148 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.8)
call void @llvm.riscv.mvin(i64 %49, i64 1125917086711816) // mvin b[0]
call void @llvm.riscv.mvin(i64 %50, i64 1125917086711820) // mvin b[1]
%call149 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.9)
call void @llvm.riscv.mvin(i64 %51, i64 1125917086711824) // mvin d[0]
call void @llvm.riscv.mvin(i64 %52, i64 1125917086711828) // mvin d[1]
%call150 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.10)
call void @llvm.riscv.mvin(i64 %53, i64 1125917086711832) // mvin c
%call151 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.11)
call void @llvm.riscv.preload(i64 1125917086711824, i64 1125917086711832) // preload d[0] c
%call152 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.12)
call void @llvm.riscv.computeProloaded(i64 1125917086711808, i64 1125917086711816) // computeProloaded a[0] b[0]
%call153 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.13)
call void @llvm.riscv.mvout(i64 %53, i64 1125917086711832) // mvout
***
call void @llvm.riscv.preload(i64 1125917086711828, i64 1125917086711832) // preload d[1] c
%call179 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.16)
call void @llvm.riscv.computeAccumulated(i64 1125917086711812, i64 1125917086711820) // computeAccumulated a[1] b[1]
%call180 = call signext i32 (ptr, ...) @printf(ptr noundef @.str.17)
call void @llvm.riscv.mvout(i64 %53, i64 1125917086711832) // mvout
***
I found that the final result Accumulated does not add up to the result of C in computeProloaded.I'm not sure what the problem is.I have studied this example,but to be honest, I didn't get it.I don't understand the role of preload,preload_zeros.
if (!preload[c]) {
gemmini_preload_zeros(out_addr);
gemmini_compute_accumulated(A_addr + a*DIM, B_addr + b*DIM);
} else if (preload_zeros[c]) {
gemmini_preload_zeros(out_addr);
gemmini_compute_preloaded(A_addr + a*DIM, B_addr + b*DIM);
} else {
gemmini_preload(D_addr + d*DIM, out_addr);
gemmini_compute_preloaded(A_addr + a*DIM, B_addr + b*DIM);
}
}
I really hope you can help me answer this question.Thanks!
I have some questions about BANK_NUM, BANK_ROWS,ACC_ROWS.You can check here.I didn't find any information about them. My guess is that it is fixed in the gemmini's hardware. Is there any source code about this part in the gemmini and whether this part can be configured by myself.
static void tiled_matmul_auto(size_t dim_I, size_t dim_J, size_t dim_K,
const elem_t* A, const elem_t* B,
const void * D, void * C,
size_t stride_A, size_t stride_B, size_t stride_D, size_t stride_C,
scale_t A_scale_factor, scale_t B_scale_factor, scale_acc_t D_scale_factor,
int act, acc_scale_t scale, acc_scale_t bert_scale,
bool repeating_bias,
bool transpose_A, bool transpose_B,
bool full_C, bool low_D,
uint8_t weightA,
enum tiled_matmul_type_t tiled_matmul_type) {
#define partition_rows (BANK_NUM * BANK_ROWS / 2)
#define mats_in_partition (partition_rows / DIM)
#define mats_in_acc (ACC_ROWS / DIM)
#define max_tile_i_j ((size_t)sqrt(mats_in_acc))
#define max_tile_k (mats_in_partition / max_tile_i_j)
// "db_" means "double-buffered"
#define db_partition_rows ((BANK_NUM * BANK_ROWS / 2) / 2)
#define db_mats_in_partition (db_partition_rows / DIM)
#define db_mats_in_acc ((ACC_ROWS / 2) / DIM)
#define db_max_tile_i_j ((size_t)sqrt(db_mats_in_acc))
#define db_max_tile_k (db_mats_in_partition / db_max_tile_i_j)
I have a problem with config_ld.Why do you use three config_ld in a row there, won't the config_ld overwrite the last one?
gemmini_config_ldis justconfig_mvinandgemmini_config_stis justconfig_mvout. (Sorry about the inconsistent naming).Whether the setting of the address has a special meaning or a special purpose.
The memory addressing scheme is described in more detail here. The address of A, B, and C are set to access the 32-bit accumulator (rather than the 8-bit parts of the scratchpad). The address of B is set to specify that B should be added on top of A, rather than overwriting A.
How does gemmini perform matrix operations?
We describe the matmul sequence here. Let me know if that isn't descriptive enough though
I now think that the description of the documentation is really not sufficient, and I think that the part that talks about the gemmini structure may need to be added.When I was learning the example of tile_matmul_ws, I didn't understand many parts of the function before calling sp_tile_matmul_ws.Probably because I don't understand deep learning and image processing.I think You can add further to the documentation about hardware structure based on the example tile_matmul_ws.c. I would also like to learn more details of gemmini.
I don't understand the role of
preload,preload_zeros
The preload commands are used to preload either a bias or weights into the spatial array. These preloaded values will remain stationary in the spatial array while you either accumulate on top of them (in the OS case) or use them to perform matmuls (in the WS case).
preload_zeros is just a convenience function which preloads zeros into the spatial array. This can be useful if you're performing an OS matmul without a bias.
I have some questions about
BANK_NUM,BANK_ROWS,ACC_ROWS
These are set during hardware elaboration. They define how many scratchpad banks we have, how many SRAM rows there are in each scratchpad bank, and how many SRAM rows there are in the accumulator.
I have a problem with config_ld.Why do you use three config_ld in a row there, won't the config_ld overwrite the last one?
We have three different sets of config_ld parameters (which is why the last number in those commands differs). By default, programmers use set 0, but if they want to intersperse loads for multiple tensors (like inputs, weights, and biases), then they can set different strides or other load-parameters for each of the three tensors at once.
I now think that the description of the documentation is really not sufficient
Sorry about that; the documentation can always be improved. Have you checked out our paper or our tutorial? You might find them more useful than our README in some cases
When I was learning the example of tile_matmul_ws, I didn't understand many parts of the function before calling sp_tile_matmul_ws
We're not really expecting people to read through the code in gemmini.h to be honest -- some of it is pretty ugly and not that readable for outsiders. But if you have questions about specific lines in tiled_matmul_ws, I might be able to help.
I don't understand the role of
preload,preload_zerosThe
preloadcommands are used to preload either a bias or weights into the spatial array. These preloaded values will remain stationary in the spatial array while you either accumulate on top of them (in the OS case) or use them to perform matmuls (in the WS case).
preload_zerosis just a convenience function which preloads zeros into the spatial array. This can be useful if you're performing an OS matmul without a bias.I have some questions about
BANK_NUM,BANK_ROWS,ACC_ROWSThese are set during hardware elaboration. They define how many scratchpad banks we have, how many SRAM rows there are in each scratchpad bank, and how many SRAM rows there are in the accumulator.
I have a problem with config_ld.Why do you use three config_ld in a row there, won't the config_ld overwrite the last one?
We have three different sets of
config_ldparameters (which is why the last number in those commands differs). By default, programmers use set0, but if they want to intersperse loads for multiple tensors (like inputs, weights, and biases), then they can set different strides or other load-parameters for each of the three tensors at once.I now think that the description of the documentation is really not sufficient
Sorry about that; the documentation can always be improved. Have you checked out our paper or our tutorial? You might find them more useful than our README in some cases
When I was learning the example of tile_matmul_ws, I didn't understand many parts of the function before calling sp_tile_matmul_ws
We're not really expecting people to read through the code in
gemmini.hto be honest -- some of it is pretty ugly and not that readable for outsiders. But if you have questions about specific lines intiled_matmul_ws, I might be able to help.
To be honest--the code in gemmini.h has taught me a lot.I think the best way to learn should be to learn other people's code.I am learning deep learning compiler. I would like to understand gemmini's isa. Thank you very much for your help!I will want to ask you for advice if I have questions.Thanks!
I'd like to ask you some detailed questions.I'm not quite sure what shrink , pixel_repeats and idmeans.
#define gemmini_extended5_config_ld(stride, scale, shrunk, block_mvin_stride, pixel_repeats, id) \
ROCC_INSTRUCTION_RS1_RS2(XCUSTOM_ACC, ((uint64_t)(scale_t_to_scale_t_bits(scale)) << 32) | ((uint64_t)(block_mvin_stride) << 16) | ((uint64_t)(pixel_repeats) << 8) | ((id) << 3) | ((shrunk) << 2) | CONFIG_LD, stride, k_CONFIG) \
printf("gemmini_config_ld %lu %lu\n",stride ,((uint64_t)(scale_t_to_scale_t_bits(scale)) << 32) | ((uint64_t)(block_mvin_stride) << 16) | ((uint64_t)(pixel_repeats) << 8) | ((id) << 3) | ((shrunk) << 2) | CONFIG_LD);
We're not really expecting people to read through the code in
gemmini.hto be honest -- some of it is pretty ugly and not that readable for outsiders. But if you have questions about specific lines intiled_matmul_ws, I might be able to help. I do not understand the following code, nor have I found the relevant information.I have used mlir to support the general directive instructions, and am now supporting the loop Instructions.Or you can send me some code of gemmini implementation, I am interested in gemmini. Although I am not very familiar with chisel.Thanks! I hope you can give me some help.
#define partition_rows (BANK_NUM * BANK_ROWS / 2)
#define mats_in_partition (partition_rows / DIM)
#define mats_in_acc (ACC_ROWS / DIM)
#define max_tile_i_j ((size_t)sqrt(mats_in_acc))
#define max_tile_k (mats_in_partition / max_tile_i_j)
// "db_" means "double-buffered"
#define db_partition_rows ((BANK_NUM * BANK_ROWS / 2) / 2)
#define db_mats_in_partition (db_partition_rows / DIM)
#define db_mats_in_acc ((ACC_ROWS / 2) / DIM)
#define db_max_tile_i_j ((size_t)sqrt(db_mats_in_acc))
#define db_max_tile_k (db_mats_in_partition / db_max_tile_i_j)
That code is just trying to determine maximum tiling parameters based on the scratchpad size in Gemmini. It assumes that half the scratchpad will be allocated for matrix A, the other half for matrix B, and the accumulator will be used to store the matmul result (C).
If we're double-buffering, then only half the scratchpad and accumulator will be available for the tiles, since the other half will be used to compute another tile.
On an exciting note, I used MLIR to do some support for the Gemmini software part and am now trying to run a model of ftlite.
Also, I am now study your project systolic dsl.I should be doing some research on systolic at some point in the future.
On an exciting note, I used MLIR to do some support for the Gemmini software part and am now trying to run a model of ftlite.
That's great! I'm excited to see what you come up with.
Also, I am now study your project systolic dsl.I should be doing some research on systolic at some point in the future.
I hope you find it useful. It was just a small class project though, so that DSL has very few features :)
That's great! I'm excited to see what you come up with.
I don't know if you saw the email I sent you, but I have successfully run the tensorflow model using Gemmini.
Hi, I've been working on Gemmini since January of this year, and I've done support for Gemmini on MLIR.You can see some of the things I've done here.But I seem to be in trouble now.Here's what I want to say.
When Gemmini is doing conv operations, the format of the conv's input and kernel is fixed in tiled_auto_conv.Because my job is to do support for Gemmini on MLIR.However there are many formats for conv in MLIR, you can see here.I need to convert to Gemmini's format before I can conv, which causes a big performance loss, is there a way to eliminate this loss here, in fact, can I modify tiled_auto_conv to eliminate this loss? I realize this may be difficult, but I'm wondering how difficult this is, or if there is another way to do it?
Thanks!
Hmm, supporting newer formats, as you noticed, can be difficult. However, if the innermost dimension is the same, then changing the format is only a matter of changing the DRAM/L3/L2 access strides in tiled_conv_auto or in LoopConv.scala. If you do want to change the innermost dimension though, then that might end up being a more significant undertaking.
Hmm, supporting newer formats, as you noticed, can be difficult. However, if the innermost dimension is the same, then changing the format is only a matter of changing the DRAM/L3/L2 access strides in
tiled_conv_autoor inLoopConv.scala. If you do want to change the innermost dimension though, then that might end up being a more significant undertaking.
Thank you very much for your reply. I probably don't care to do any more research on it at the moment though. I'd actually be happy to share my other current dilemmas here, I'm graduating probably next year and I'm looking for a job right now, but it's not actually easy to find a job as an AI compiler engineer.
I'm pondering whether I want to stick to this path in the future (even though doing a compiler is my dream) or go to grad school, which is actually not easy either way (honestly, we're kind of friends, so maybe I'd like to hear your opinion or some words of comfort.)
Haha, why don't you email me directly at hngenc [at] berkeley [dot] edu, and I can give you my personal thoughts about grad school vs industry
Haha, I've emailed you.