volk icon indicating copy to clipboard operation
volk copied to clipboard

sse2neon on volk_8u_x4_conv_k7_r2_8u

Open Aang23 opened this issue 1 year ago • 5 comments

This PR adds a NEON-compatible version of the volk_8u_x4_conv_k7_r2_8u kernel, leading to a very noticeable boost in performance on ARM.

Due to the way Volk is done, I had to copy-paste the entire SSE kernel despite only the headers being swapped out. For good measures, the puppet was also copy-pasted, even though it may not be really necessary. (I can change it back depending on your preference)

First mentioned in https://github.com/gnuradio/volk/issues/243.

I am submitting this as a PR not especially to merge straight away, as sse2neon and everything this method may/will imply may need to be discussed further first.

Aang23 avatar Aug 10 '22 15:08 Aang23

Not a real machine, but here's what CI had to say about performances. Screenshot from 2022-08-10 17-57-33

The fail on gcc-8 is unrelated to this PR (at least it seems like it), but I do not know how formatting sse2neon should be dealt with.

Aang23 avatar Aug 10 '22 16:08 Aang23

I'd suggest to leave the puppet in. Due to the way our CI and volk_profile work, this is sometimes necessary. Otherwise, we'd miss this kernel in tests.

Could you add a comment section that explains how things are supposed to work?

I suggest to re-format the kernel.

jdemel avatar Aug 11 '22 12:08 jdemel

Ok, I will leave it in.

The kernel is not the main formatting issue, sse2neon.h is. Should it just be ignored or reformatted as well?

Comments-wise, I assume you mean about sse2neon and so on?

I will also test on real hardware in a day or two. Don't have anything on hand for that right now.

Aang23 avatar Aug 11 '22 15:08 Aang23

We might want to update sse2neon.h. In that case, it'd be a shame if we re-format it and we can't diff anymore.

We can go 2 routes. And I feel like a bit of redundancy might be good here. First, we can ignore sse2neon.h in our formatting checks: https://github.com/gnuradio/volk/blob/0371bf8c50d44db49a86e5c9dc993fd345a51645/.github/workflows/check-pr-formatting.yml#L4-L9 Second, use clang-format off in the whole sse2neon.h file. Even better with a comment that provides further info why we do this.

jdemel avatar Aug 15 '22 08:08 jdemel

That seems to work. I've added both methods.

Aang23 avatar Aug 15 '22 16:08 Aang23

Done! About run-tests.yml.txt, that wa just an oversight. I ran actions on my local machines and wanted to avoid running all tests to simply check formatting. Unfortunately, I forgot to revert before commiting.

Thanks for accepting this change as well! It made a pretty huge difference in usability in some parts of my software on ARM. This should also give GNU Radio (well, the CC Decoder, as it's the only one calling Volk's SIMD) a nice boost as well, and hopefully for a few other projects using this kernel. As a note, I performed the same tests I used back when it was about fixing the generic implementation of this kernel. BER results of this PR match the SSE implementation very closely.

Also, for reference, qt-dab also supports using a Spiral-generated SSE kernel (though, a different code) with sse2neon, so it's nothing that new either.

Aang23 avatar Aug 16 '22 09:08 Aang23

One last update, as I tested this on real HW (and not Android).

RUN_VOLK_TESTS: volk_8u_conv_k7_r2puppet_8u(131071,1987)
neonspiral completed in 16217.7 ms
generic completed in 118974 ms

So that's ~7.5x faster on arm64-v8a. (This was tested on an RK3318 @ 1.0Ghz)

Aang23 avatar Aug 17 '22 11:08 Aang23