XiangShan
XiangShan copied to clipboard
Simulation hangs for longer running functions using the vector extension
Recently RVV support was merged into the master branch, and I tried running a few of my benchmarks on it, but ran into problems. Only very basic RVV functions worked, the others seem to silently hang the simulation.
For the following I've modified the $AM_HOME/apps/hello example code, and added asm.S to SRCS in the Makefile. I've attached my entire reproducible docker setup at the end of the issue.
Here are two of the programs that hang the simulation indefinitely:
// asm.S
.text
.balign 8
.global ascii_to_utf16
ascii_to_utf16:
1:
vsetvli t0, a2, e8, m1, ta, ma
vle8.v v0, (a1)
vsetvli x0, x0, e16, m2, ta, ma # this originally had a bug, and used mf2
vzext.vf2 v8, v0
vse16.v v8, (a0)
add a1, a1, t0
sub a2, a2, t0
slli t0, t0, 1
add a0, a0, t0
bnez a2, 1b
ret
// hello.c
#include <klib.h>
size_t ascii_to_utf16(uint16_t *dst, uint8_t *src, size_t n);
int main(void) {
static uint8_t src[100] = {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0};
static uint16_t dst[sizeof src]={};
printf("beg\n");
ascii_to_utf16(dst, src, sizeof src);
printf("end\n");
return 0;
}
# asm.S
.text
.balign 8
.global LUT4
LUT4:
li t0, 16
vsetvli zero, t0, e8, m1, ta, ma
vle8.v v0, (a0)
1:
vsetvli a0, a2, e8, m1, ta, ma
vle8.v v8, (a1)
vand.vi v8, v8, 15
vrgather.vv v16, v0, v8
vse8.v v16, (a1)
sub a2, a2, a0
add a1, a1, a0
bnez a2, 1b
ret
// hello.c
#include <klib.h>
size_t LUT4(uint8_t lut[16], uint8_t *ptr, size_t n);
int main(void) {
static uint8_t mem[100];
static uint8_t lut[16] = { 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 1, 2, 3, 4, 5, 6 };
printf("beg\n");
LUT4((uint8_t *)lut, mem, sizeof mem);
printf("end\n");
return 0;
}
The problems only seem to occur with a larger iteration counts, e.g. the ascii_to_utf16 code works fine when processing 80 instead of 100 elements. This seems to indicate that there might be a problem with a scheduler or internal buffer filling up?
Since I also ran into problems on other implementations, I've got a quick instruction testing script that executes random instructions. However, the ~50 trials of short random instruction streams I've tested didn't run into any problems. That's good and points towards this being a single problem, that seems to only occur with longer runs.
Environment Reproduction
I've used the following Dockerfile to build the repository on top of the latests commit to master. It was run when 0c00289 was the latest commit, since they there is only a single new one, that doesn't look like it would fix the problem, since it's a tiny adjustment to the LSU.
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y build-essential clang libclang-dev llvm-dev cmake libspdlog-dev vim git libmlpack-dev curl wget time default-jre default-jdk
RUN git clone --recursive https://github.com/OpenXiangShan/xs-env
WORKDIR /xs-env
RUN sed 's/apt\S* install/\0 -y/g;s/source /. /g;s/sudo //g' -i ./*.sh && echo 1
RUN . ./env.sh && sed 's/$/; cd \/xs-env/g' -i ./update-submodule.sh && ./update-submodule.sh
RUN . ./env.sh && ./setup-tools.sh
RUN . ./env.sh && . ./install-verilator.sh
RUN . ./env.sh && sed 's/^git submodule.*$//g;s/env.*$//g' -i ./setup.sh && . ./setup.sh
RUN . ./env.sh && make -C XiangShan init
RUN . ./env.sh && make -C XiangShan emu CONFIG=DefaultConfig MFC=1 -j 8
RUN . ./env.sh && sed 's/unknown-//g;s/rv64gc/rv64gcv/g' -i $AM_HOME/am/arch/isa/riscv64.mk
# Once in the docker enviroment, I used the following to build and simulate the programs:
# source env.sh; cd $AM_HOME/apps/hello
# make ARCH=riscv64-xs && $NOOP_HOME/build/emu --no-diff -i ./build/hello-riscv64-xs.bin 2>/dev/null
PS: I've also ran into problems with rdcycle not working properly with vector instructions, a loop with 10x more iterations took fewer cycles than one with fewer iterations. Is rdcycle supposed to work with vector instruction in the current implementation? I'll have to investigate this further, and share reproducible code.
Update I tried running it on a few other branches: adc944d tmp-backend-fixtiming-merge-master: same problem 824af1e vlsu-240315: same problem but worse, even a lower iteration count froze the simulation.
I also ran the current master with the MinimalConfig, instead of DefaultConfig.
This caused ascii_to_utf16 to run flawlessly, but the LUT4 program hit an assertion instead of stalling:
# LUT4 error:
Assertion failed at line 170883.
The simulation stopped. There might be some assertion failed.
Core 0: ABORT at pc = 0x80001248
instrCnt = 278, cycleCnt = 3643, IPC = 0.076311
Seed=0 Guest cycle spent: 3646 (this will be different from cycleCnt if emu loads a snapshot)
Host time spent: 8451ms
[ERROR][time= 3645] TOP.SimTop.l_soc.core_with_l2.core.frontend.ftq:
commit cfi can be non c_commited
Assertion failed
at LogUtils.scala:54 assert(false.B)
Other functions still seem to stall though, e.g.:
# asm.S
.text
.balign 8
# generated by clang, see: https://github.com/camel-cdr/rvv-bench/blob/main/bench/mandelbrot.S
.global mandelbrot_rvv
mandelbrot_rvv:
beqz a0, rvv_13
beqz a1, rvv_9
li a7, 0
fcvt.s.wu fa5, a0
lui a3, 262144
fmv.w.x fa4, a3
fdiv.s fa5, fa4, fa5
lui a3, 785408
fmv.w.x fa4, a3
lui a3, 784384
fmv.w.x fa3, a3
lui a3, 264192
fmv.w.x fa2, a3
slli a6, a0, 2
j rvv_4
rvv_3:
addi a7, a7, 1
add a2, a2, a6
beq a7, a0, rvv_13
rvv_4:
fcvt.s.wu fa1, a7
mv t0, a0
j rvv_6
rvv_5:
slli a3, t0, 2
add a3, a3, a2
vsetvli zero, zero, e32, m1, ta, ma
vse32.v v8, (a3)
beqz t0, rvv_3
rvv_6:
vsetvli t1, t0, e32, m1, ta, ma
sub t0, t0, t1
vmset.m v0
vmv.v.i v8, 0
viota.m v10, v0
vadd.vx v10, v10, t0
vfcvt.f.xu.v v10, v10
vfmv.v.f v12, fa1
vfmul.vf v10, v10, fa5
vfadd.vf v10, v10, fa4
vfmul.vf v12, v12, fa5
vfadd.vf v12, v12, fa3
vmv.v.i v18, 0
li a3, 1
mv a5, a1
vmv.v.i v14, 0
vmv.v.i v16, 0
vmv.v.i v20, 0
rvv_7:
vsetvli zero, t1, e8, mf4, ta, ma
vfirst.m a4, v0
bltz a4, rvv_5
vsetvli zero, zero, e32, m1, ta, ma
vfadd.vv v22, v16, v20
vmflt.vf v0, v22, fa2
vfsub.vv v16, v16, v20
vfadd.vv v18, v18, v18
vfadd.vv v22, v16, v10
vfmadd.vv v14, v18, v12
vfmul.vv v16, v22, v22
vfmul.vv v20, v14, v14
vmerge.vxm v8, v8, a3, v0
addi a5, a5, -1
addi a3, a3, 1
vmv.v.v v18, v22
bnez a5, rvv_7
j rvv_5
rvv_9:
slli a3, a0, 2
rvv_10:
mv a4, a0
rvv_11:
vsetvli a5, a4, e32, m1, ta, ma
sub a4, a4, a5
vmv.v.i v8, 0
slli a5, a4, 2
add a5, a5, a2
vse32.v v8, (a5)
bnez a4, rvv_11
addi a1, a1, 1
add a2, a2, a3
bne a1, a0, rvv_10
rvv_13:
ret
// hello.c
#include <klib.h>
void mandelbrot_rvv(size_t width, size_t maxIter, uint32_t *res);
int main(void) {
#define W 10
static uint32_t img[W*W] = {0.0f};
printf("beg\n");
mandelbrot_rvv(W, 20, img);
printf("end\n");
return 0;
}
Update:
Retested on newer branches:
7fd388cb: all problems persist
78c76c7: all problems persist
7390003: all problems persist
Thank you for your bug report, we are handling this.
The vector extension is still work-in-progress. It may be more stable after Apr. 30.
I just tried running it on the development branches, and while it behaved the same on fp-split and new-csr, the mandelbrot and LUT4 code snippets completed successfully on the vlsu-240315 branch using the MinimalConfig, even when increasing the iteration count. ascii_to_utf16 however still hangs on that branch. It does complete however, when I remove the vzext.vf2 v8, v0 instruction, so that might be the cause of this bug.
I'll now try it again on DefaultConfig, and update this comment once it's done building, and I could run the tests.
Update: DefaultConfig still hangs on the vlsu-240315 branch the LUT4 and ascii_to_utf16 code, but mandelbrot works fine even with larger inputs.
Edit: Just tried the vlsu-merge-master-0504, which from what I can tell merges the vlsu-240315 branch with master, and the problems are back. Sounds like it was introduced between those commits.
I just tried running it on the development branches, and while it behaved the same on fp-split and new-csr, the
mandelbrotandLUT4code snippets completed successfully on the vlsu-240315 branch using theMinimalConfig, even when increasing the iteration count.ascii_to_utf16however still hangs on that branch. It does complete however, when I remove thevzext.vf2 v8, v0instruction, so that might be the cause of this bug.I'll now try it again on DefaultConfig, and update this comment once it's done building, and I could run the tests.
Update:
DefaultConfigstill hangs on the vlsu-240315 branch theLUT4andascii_to_utf16code, butmandelbrotworks fine even with larger inputs.Edit: Just tried the vlsu-merge-master-0504, which from what I can tell merges the vlsu-240315 branch with master, and the problems are back. Sounds like it was introduced between those commits.
Thank you very much for your attention to the development of XiangShan and sorry for not replying in time. At present, the vector extension of XiangShan is under development, and the support for segment instruction is not perfect yet. Due to the reasons of time and manpower, there are still some problems for the time being. We will conduct the test of rvv-bench in the future.
FYI: none of the code samples above work now, and while memcpy still works, even things like saxpy stopped working on the latest commit:
// asm.S
.text
.balign 8
.global saxpy
saxpy:
vsetvli a4, a0, e32, m8, ta, ma
vle32.v v0, (a1)
sub a0, a0, a4
slli a4, a4, 2
add a1, a1, a4
vle32.v v8, (a2)
vfmacc.vf v8, fa0, v0
vse32.v v8, (a2)
add a2, a2, a4
bnez a0, saxpy
ret
// hello.c
#include <klib.h>
void saxpy(size_t n, float a, float *b, float *c);
int main(void) {
static float src[128] = { 1, 2, 3, 4, 5 }, dst[128] = { 0 };
printf("beg\n");
saxpy(128, 0.3, src, dst);
printf("end\n");
return 0;
}
I'm currently building the DefaultConfig with DRAMsim3.
FYI: none of the code samples above work now, and while memcpy still works, even things like saxpy stopped working on the latest commit:
// asm.S .text .balign 8 .global saxpy saxpy: vsetvli a4, a0, e32, m8, ta, ma vle32.v v0, (a1) sub a0, a0, a4 slli a4, a4, 2 add a1, a1, a4 vle32.v v8, (a2) vfmacc.vf v8, fa0, v0 vse32.v v8, (a2) add a2, a2, a4 bnez a0, saxpy ret // hello.c #include <klib.h> void saxpy(size_t n, float a, float *b, float *c); int main(void) { static float src[128] = { 1, 2, 3, 4, 5 }, dst[128] = { 0 }; printf("beg\n"); saxpy(128, 0.3, src, dst); printf("end\n"); return 0; }I'm currently building the DefaultConfig with DRAMsim3.
Thank you for your concerns and questions. I am very sorry for not replying in time. Please re-test on the branch of vlsu-240315, if the problem still exists, please contact us for feedback.
Hi @Anzooooo, thanks for the notice.
I tested it again just now and realized that ascii_to_utf16 had a copy and past bug from my benchmark, which I've now adjusted in the first message.
The MinimalConfig on both the vlsu-240315 branch and the master branch ran all the snippets successfully, however both hang in the vectorized saxpy indefinitely for the DefaultConfig + DRAMSim3. The saxpy works on an array with up to 64 entries, but after it just hangs if it's passed more than that.
I used make -C XiangShan emu CONFIG=DefaultConfig WITH_DRAMSIM3=1 MFC=1 -j 8 for the DefaultConfig build.
Since MinimalConfig + DRAMSim3 works, I'll use this for testing and benchmarking for now, but being able to use the full configuration would be nice.
@Anzooooo
Ok, this is really weird. The DefaultConfig works if I remove any print statements before the RVV code.
So for the test cases commenting out the printf("beg\n"); makes it work. I've got no idea why this would be happening, maybe nexus-am has problems switching to m-mode and back with RVV?
@Anzooooo Ok, this is really weird. The
DefaultConfigworks if I remove any print statements before the RVV code. So for the test cases commenting out theprintf("beg\n");makes it work. I've got no idea why this would be happening, maybe nexus-am has problems switching to m-mode and back with RVV?
Hi @camel-cdr , thank you for your question. In your question, I noticed that you have ported 'rvv-bench' to 'nexus-am' for testing. Could you please provide your code to help us find the problem?
@Anzooooo I haven't ported that yet, only isolated benchmarks, for testing.
The problem occurs with the code examples I posted here.
When I try executing the saxpy code above, then it hangs after printing "beg". When I remove the printf("beg\n");, then it runs successfully and prints "end".
I figured out that you can also add a __asm volatile("fence.i"); after the printf("beg\n");, and it will run successfully and prints "beg\nend\n".
The testing I did with that also ran into a hanging problem, but I used the master branch for that and have yet to test vlsu-240315. But the printf + fence.i things were tested on vlsu-240315 as well.
@camel-cdr Thanks for your question.
We have fixed the issue at this commit(1a0c5d77fed8b13a9fa5f70d1f25d808a311dd7b) in this pull requests (https://github.com/OpenXiangShan/XiangShan/pull/3140). This will later be integrated into the master.
In addition, we will be developing on master later, and the vlsu-240315 branch will be scrapped, if you want to test later, please do it on master.
@Anzooooo Great, I tried the PR, and the code here works now. I still run into other problems when I try to run my benchmark code, but I'll create a new issue once the PR has been merged with reproduction steps, that I still need to figure out.
The last problem I mentioned was apparently caused by a nexus-am bug, almost all code from rvv-bench runs. The only things that don't include complex load store/stores, idl if those are supposed to work yet.
It's currently running, I'll update the website with results once it's done, and I'll include build instructions.
The last problem I mentioned was apparently caused by a nexus-am bug, almost all code from rvv-bench runs. The only things that don't include complex load store/stores, idl if those are supposed to work yet.
It's currently running, I'll update the website with results once it's done, and I'll include build instructions.
Thanks for your contributions to Xiangshan, we have merge the PR into nexus-am.
Looks like this is a problem again. I tried running the first example on the latest commit, and it freezes again in the DefaultConfig.
Now any vector instruction seems to cause the freeze, e.g.:
// asm.S
.text
.balign 8
.global foo
foo:
vsetvli t0, x0, e8, m1, ta, ma
ret
// hello.c
#include <klib.h>
void foo(void);
int main(void) {
printf("beg\n"); foo(); printf("end\n");
return 0;
}
I'm supersized this wasn't caught by the new rvv-test ci, maybe it works for MinimalConfig, but not for DefaultConfig.
Idk why the CI works, I just tried with MinimalConfig again, and the above with just vsetvli fails an assertion:
Assertion failed at /xs-env/XiangShan/build/rtl/Rob.sv:37719.
Core 0: ABORT at pc = 0x10
The dockerfile is still the same as before, I've only enabled DRAMsim3:
FROM ubuntu:23.04
RUN apt-get update && apt-get install -y build-essential clang libclang-dev llvm-dev cmake libspdlog-dev vim git curl wget time default-jre default-jdk
RUN git clone --recursive https://github.com/OpenXiangShan/xs-env
WORKDIR /xs-env
RUN sed 's/apt\S* install/\0 -y/g;s/source /. /g;s/sudo //g' -i ./*.sh
RUN . ./env.sh && sed 's/\/master/\/master/g;s/$/; cd \/xs-env/g' -i ./update-submodule.sh && ./update-submodule.sh
RUN . ./env.sh && ./setup-tools.sh
RUN . ./env.sh && . ./install-verilator.sh
RUN . ./env.sh && sed 's/^git submodule.*$//g;s/env.*$//g' -i ./setup.sh && . ./setup.sh
RUN . ./env.sh && make -C XiangShan init
RUN . ./env.sh && cd DRAMsim3 && mkdir build && cd build && cmake -D COSIM=1 .. && make -j 8
RUN . ./env.sh && make -C XiangShan emu CONFIG=MinimalConfig WITH_DRAMSIM3=1 MFC=1 -j 8
#RUN . ./env.sh && make -C XiangShan emu CONFIG=DefaultConfig WITH_DRAMSIM3=1 MFC=1 -j 8
RUN . ./env.sh && sed 's/unknown-//g;s/rv64gc/rv64gcv/g' -i $AM_HOME/am/arch/isa/riscv64.mk
Edit: still a problem as of 2024-08-15
Looks like this is a problem again. I tried running the first example on the latest commit, and it freezes again in the DefaultConfig.
Now any vector instruction seems to cause the freeze, e.g.:
// asm.S .text .balign 8 .global foo foo: vsetvli t0, x0, e8, m1, ta, ma ret // hello.c #include <klib.h> void foo(void); int main(void) { printf("beg\n"); foo(); printf("end\n"); return 0; }I'm supersized this wasn't caught by the new rvv-test ci, maybe it works for MinimalConfig, but not for DefaultConfig.
In Xiangshan, if you do not set the VS field in the status CSR, executing any vector instruction will trigger an illegal instruction exception. Please set the VS first and then execute the vector instruction.
Ah, thanks. I thought nexus-am already does it, but apparently it just wasn't checked until NewCSR was merged. I'm rerunning the benchmarks now.