volk RISC-V Vector 1.0 Support: If and where to start?

Hello,

RISC-V's Vector extension was ratified a few years back and recently vector-supporting boards have come out, many based on the octo-core Spacemit K1/M1. I have been using one such board for a while now, with gnuradio and more, but the performance is quite lacking: when profiling with volk_profile, the generic is around an order of magnitude faster than the alternatives

All that to say, I think RVV 1.0 has many instructions useful for volk and I am willing to help but I have no idea where to start. If this is in the project's plans, is there a roadmap?

Sep 17 '24 17:09 JakeSaphhire

Thanks for your interest in the topic.

We don't have a fixed roadmap. Though, we're interested in adding support for as many platforms as possible, as long as this is supportable. Risc-V checks all those boxes.

We build and test on Risc-V already. Next steps that would be great to tackle are

add another Risc-V machine with the vector extension enabled. Smth like the -mavx flags for x86.
Add checks to dynamically detect the presence of the vector extension, preferably via cpu_features
Start to add kernels that use Risc-V vector intrinsics.

Especially, the infrastructure together with a first kernel for Risc-V would be great. More optimized kernels should be addable way easier afterwards. Depending on the compiler, it may be already very beneficial to have the extension and vector machine available. That needs benchmarking of course.

Sep 18 '24 14:09 jdemel

What compiler are you using? I'm told gcc 14 has support for riscv vector instructions.

Sep 18 '24 15:09 balister

FYI, to avoid duplicating work: I'm starting to implement some of the kernels (starting from the alphabetically first).

I haven't worked with the project before, so I'm unfamiliar with the build structure and CI.

Oct 04 '24 15:10 camel-cdr

Quick update, I'm now about halfway through the kernels. Should they be optimized for smaller input sizes, <1000 elements, or does the gnuradio use case usually use very large chunks? I'm using the overloaded v1.0 intrinsics, which are supported in gcc >=14, and clang >=18.

Oct 08 '24 22:10 camel-cdr

The GR use case is probably mostly in the 1k-10k element range. Obviously, this might vary. Further, our default benchmarks uses 2^17-1 elements. This is typically too large but an historical artefact. Since your changes would require a rather recent compiler, I suggest to ifdef your contributions such that older compilers don't try to compile what they can't.

Oct 09 '24 07:10 jdemel

Good to see @camel-cdr here. The DVB-T2 transmitter in GNU Radio uses quite a few kernels with fairly large vectors. I also have "bit perfect" test files for the example flow graphs (although for floating point, you have to compare with some margin).

UPDATE: The DVB-T2 flow graph I'm considering uses pretty big vectors. 32768 * 19 = 622,592 complex elements (1,245,184 floats).

Let me know if you want to use that strategy for testing, and I'll set you up with a set of test files.

Also, there's some discussion about infrastructure in https://github.com/gnuradio/volk/pull/625

DVB-T2 transmitter kernels
volk_32fc_32f_multiply_32fc
volk_32fc_x2_add_32fc
volk_32f_s32f_multiply_32f
volk_32fc_magnitude_32f
volk_32fc_s32fc_multiply2_32fc
volk_32fc_s32fc_multiply_32fc
volk_32f_x2_subtract_32f
volk_32fc_x2_multiply_conjugate_32fc
volk_32f_x2_add_32f
volk_32fc_x2_multiply_32fc

Oct 10 '24 07:10 drmpeg

Obviously, there's no one size fits all. @drmpeg these are quite large, and DVB typical values. I hope that most kernels perform comparably well. I suppose testing for short, full-ish L1 cache, etc. makes the most sense.

Oct 10 '24 18:10 jdemel

As it turns out, I was in error. After remembering what I implemented, the vector size is only 32768 complex elements.

Oct 10 '24 18:10 drmpeg

I've asked regarding the input size because I'm writing the kernels to maximize LMUL without causing spills. This means most things are implicitly unrolled 8 times, and the loop is, for N!=0, always traversed once.

For benchmarking I was just planning to run volk_profile, but if there is something else I can easily test I'd also be interested.

One annoyance is that the RISC-V toolchain doesn't provide a way to add single extensions with a command line argument, you can just set -march to a fix isa string. The best way I could think of for solving this is by always making sure the last arch of a machine sets all previous extensions.

Something like this:

<machine name="rv64gcv">
<archs>generic riscv64 rvv orc|</archs>
</machine>

<!--machine name="rva22v">
<archs>generic riscv64 rvv rvb rva22v orc|</archs>
</machine>

<machine name="rva23">
<archs>generic riscv64 rvv rvb rva22v rva23 orc|</archs>
</machine-->

RVA22 and RVA23 are profiles, but google/cpu_features doesn't support them or their extensions currently. google/cpu_features's RISC-V extension parsing is fundamentally broken at the moment, but this is unlikely to affect anything with just rv64gcv. (It would parse rv64gc_xmycustomextensionwithavsomewhere as rv64gcv) I've created a fix, but IDK how to sign the CLA, so who knows when this will be fixed: https://github.com/google/cpu_features/pull/368

Oct 10 '24 19:10 camel-cdr

For x86, we do -mavx, -mavx2, etc. Does that work for RiscV? I know some of the compiler flags in this realm behave differently depending on the ISA.

Is rva22v strictly < rva23? I'm glad they introduced profiles. Everything else is hard do keep track of.

volk_profile is our long term tool. Another option would be google/benchmark to implement micro benchmarks.

Your machine definitions look sane to me. My gut feeling is that we need to get started with RiscV kernels and potentially, we'd need re-organize our support code (or extend it) when we realize that our approach doesn't work long-term. At the moment, I'd like to encourage you to do what you think makes the most sense.

Oct 11 '24 20:10 jdemel

I’ll just add that the rva22u64 profile also includes bitmap instructions, which some of the kernels might be able to use (e.g. there’s a popcount instruction).

Oct 19 '24 08:10 michael-roe

Yeah, I didn't want to create too many different targets, so I choose base rvv, rva22+v, and rva23, which also includes Zvbb.

I've also created a pseudo target rvvseg, that uses segmented load stores when dealing with complex numbers, because they aren't fast on all current hardware (C910). (the regular rvv target uses vnsrl to deinterleave the complex number components)

I'll try to get it ready for a PR this weekend.

Oct 19 '24 09:10 camel-cdr