volk
volk copied to clipboard
sse2neon on volk_8u_x4_conv_k7_r2_8u
This PR adds a NEON-compatible version of the volk_8u_x4_conv_k7_r2_8u kernel, leading to a very noticeable boost in performance on ARM.
Due to the way Volk is done, I had to copy-paste the entire SSE kernel despite only the headers being swapped out. For good measures, the puppet was also copy-pasted, even though it may not be really necessary. (I can change it back depending on your preference)
First mentioned in https://github.com/gnuradio/volk/issues/243.
I am submitting this as a PR not especially to merge straight away, as sse2neon and everything this method may/will imply may need to be discussed further first.
Not a real machine, but here's what CI had to say about performances.
The fail on gcc-8 is unrelated to this PR (at least it seems like it), but I do not know how formatting sse2neon should be dealt with.
I'd suggest to leave the puppet in. Due to the way our CI and volk_profile work, this is sometimes necessary. Otherwise, we'd miss this kernel in tests.
Could you add a comment section that explains how things are supposed to work?
I suggest to re-format the kernel.
Ok, I will leave it in.
The kernel is not the main formatting issue, sse2neon.h is. Should it just be ignored or reformatted as well?
Comments-wise, I assume you mean about sse2neon and so on?
I will also test on real hardware in a day or two. Don't have anything on hand for that right now.
We might want to update sse2neon.h
. In that case, it'd be a shame if we re-format it and we can't diff anymore.
We can go 2 routes. And I feel like a bit of redundancy might be good here.
First, we can ignore sse2neon.h
in our formatting checks:
https://github.com/gnuradio/volk/blob/0371bf8c50d44db49a86e5c9dc993fd345a51645/.github/workflows/check-pr-formatting.yml#L4-L9
Second, use clang-format off in the whole sse2neon.h
file. Even better with a comment that provides further info why we do this.
That seems to work. I've added both methods.
Done! About run-tests.yml.txt, that wa just an oversight. I ran actions on my local machines and wanted to avoid running all tests to simply check formatting. Unfortunately, I forgot to revert before commiting.
Thanks for accepting this change as well! It made a pretty huge difference in usability in some parts of my software on ARM. This should also give GNU Radio (well, the CC Decoder, as it's the only one calling Volk's SIMD) a nice boost as well, and hopefully for a few other projects using this kernel. As a note, I performed the same tests I used back when it was about fixing the generic implementation of this kernel. BER results of this PR match the SSE implementation very closely.
Also, for reference, qt-dab also supports using a Spiral-generated SSE kernel (though, a different code) with sse2neon, so it's nothing that new either.
One last update, as I tested this on real HW (and not Android).
RUN_VOLK_TESTS: volk_8u_conv_k7_r2puppet_8u(131071,1987)
neonspiral completed in 16217.7 ms
generic completed in 118974 ms
So that's ~7.5x faster on arm64-v8a. (This was tested on an RK3318 @ 1.0Ghz)