rvv-bench
rvv-bench copied to clipboard
Add benches for strided load/store with different strides
Just found an issue on K230 when doing some auto-vectorization tests on https://github.com/UoB-HPC/TSVC_2.
The vectorized s1115
is like:
.LBB9_7: # %vector.ph
andi a6, s6, 256
vsetvli a2, zero, e32, m2, ta, ma
.LBB9_8: # %vector.body
vl2re32.v v8, (a4)
vlse32.v v10, (a5), s11 # s11 = 1024
vl2re32.v v12, (a2)
vfmacc.vv v12, v8, v10
vs2r.v v12, (a4)
add a4, a4, s0
add a2, a2, s0
sub a3, a3, s9
add a5, a5, s2
bnez a3, .LBB9_8
It seems that strided load/store with strides in [1024, 4096] have a worse performance. A simple probe code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define DEFINE_VLSE(LMUL) \
__attribute__((always_inline)) void vlse_##LMUL(int *base, int stride) { \
__asm__("vsetvli t0, zero, e8, " #LMUL ", ta, ma\n" \
"vlse8.v v0, (%0), %1" ::"r"(base), \
"r"(stride)); \
}
DEFINE_VLSE(m1)
DEFINE_VLSE(m2)
DEFINE_VLSE(m4)
DEFINE_VLSE(m8)
DEFINE_VLSE(mf2)
DEFINE_VLSE(mf4)
DEFINE_VLSE(mf8)
int main(int argc, char **argv) {
int stride = atoi(argv[1]);
int times = atoi(argv[2]);
// __attribute__((aligned(64)))
int data[64 * stride];
#define BENCH_VLSE(LMUL) \
{ \
clock_t start = clock(); \
for (int i = 0; i < times; i++) \
vlse_##LMUL(data, stride); \
clock_t end = clock(); \
printf("LMUL: " #LMUL "\tstride: %d\t time: %ld\n", stride, end - start); \
}
BENCH_VLSE(mf8)
BENCH_VLSE(mf4)
BENCH_VLSE(mf2)
BENCH_VLSE(m1)
BENCH_VLSE(m2)
BENCH_VLSE(m4)
BENCH_VLSE(m8)
}
The result is like (I highlight the abnormal results):
MF8 | MF4 | MF2 | M1 | M2 | M4 | M8 | |
---|---|---|---|---|---|---|---|
4 | 38479 | 51332 | 76931 | 128148 | 230645 | 435399 | 844990 |
8 | 38521 | 51333 | 76922 | 128128 | 230579 | 435395 | 844891 |
16 | 38530 | 51323 | 76962 | 128129 | 230566 | 435341 | 845195 |
32 | 38511 | 51373 | 76932 | 128150 | 230656 | 435388 | 845083 |
64 | 38529 | 51322 | 76947 | 128205 | 230624 | 435417 | 23954097 |
128 | 38517 | 51338 | 76926 | 128128 | 230608 | 12351222 | 31148420 |
256 | 38487 | 51288 | 76945 | 128152 | 5824701 | 15177587 | 34006290 |
512 | 38526 | 51292 | 76943 | 2855170 | 7439032 | 16828930 | 35689412 |
1024 | 38511 | 51324 | 1152269 | 3424329 | 7957662 | 17053724 | 35144136 |
2048 | 38520 | 224200 | 709725 | 1396708 | 4226251 | 8330476 | 16689498 |
4096 | 38507 | 317053 | 640199 | 1507778 | 3093916 | 6358825 | 12725241 |
8192 | 38499 | 51349 | 76956 | 128285 | 1255252 | 2483829 | 4943195 |
16384 | 38525 | 51329 | 76975 | 128337 | 1255245 | 2484334 | 4975494 |
It's weird that we can have a better performance when stride is larger than 4096, so this issue may not be related to crossing cache-lines or pages. It may be an issue about hardware prefetcher.
So, my request is, can we add some benches of this kind of scenario?
I'll look into it, this could be a new load/store benchmark under the instructions folder. I tried adding the load/store instructions to the other instruction measurements, but they didn't really fit into that framework anyways.
The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.
IIRC, you could adjust the prefetch mode in the C920, so the C908 might support that as well.
The behavior is indeed quite weird, but how could that be a problem with the cachelines or prefetcher? Shouldn't the CPU be easily able to hold 16 cache lines, for the m1 case. I mean, it's repeatedly accessing the same few addresses.
Currently, this is just a guess (the L1DCache misses increase a lot) and I have sent a feedback to t-head.