volk icon indicating copy to clipboard operation
volk copied to clipboard

Missing NEON implementations

Open ast opened this issue 5 years ago • 38 comments

Here's a list of kernels missing NEON implementations. Some of these are easy, some are hard, some would not provide any benefit. I'm determined to write as many as I can, so I'm creating this list to keep track.

Benchmarks are rough estimates based on volk_test_all on a Raspberry PI 3b aarch64 mode.

My branch is here https://github.com/TujaSDR/volk/tree/tuja

Queued for improvement

Ideas for new kernels

  • Sum real + imaginary, very useful for SSB demodulation.
  • Full scale int32_t and int16_t to float +-1.0 using hardware instructions for fixed point.
  • Rotator by Pi (probably the fastest way of centering a FFT on DC)
  • Rotator by +-Pi/2 (very efficient way for down conversion by +-Pi/2)

New kernels added

  • [x] volk_32fc_add_real_imag_32f
  • [x] volk_32i_convert_32f
  • [x] volk_32f_convert_32i

Kernels where I updated NEON support

  • [x] volk_32fc_convert_16ic_neon (neonv8 x1.32 and cleaner)
  • [x] volk_32f_index_max_32u (x1.18)
  • [x] volk_32fc_magnitude_32f (added neonv8, it's a tie with neonv7)
  • [x] volk_32f_log2_32f (not faster)

Kernels that were missing NEON support

  • [x] PR sent volk_32fc_s32fc_x2_rotator_32fc.h (~14x (!) speedup)
  • [x] PR sent volk_16ic_magnitude_16i.h (neonv8 is ~3.7x faster)
  • [x] PR sent volk_32f_sin_32f.h (~3.5x, managed to push it to ~x4)
  • [x] PR sent volk_32f_cos_32f.h (~3.5x, managed to push it to ~x4)
  • [x] PR sent volk_32f_tan_32f.h (~4.7x, managed to push it to ~x5.6)
  • [x] PR sent volk_32fc_index_max_32u.h (~3.1x faster)
  • [x] PR sent volk_32fc_s32f_power_spectrum_32f.h (~6x speedup)
  • [x] volk_32f_asin_32f.h (~9x)
  • [x] volk_32f_acos_32f.h (~9x)
  • [x] volk_32f_tanh_32f.h (~7.2x speedup)
  • [x] volk_16ic_s32f_magnitude_32f.h (~3.3x faster)
  • [x] volk_32f_s32f_convert_32i.h (~3.3x speedup neonv8 because of rounding)
  • [x] volk_32f_expfast_32f.h (~3.3x speedup)
  • [x] volk_32i_s32f_convert_32f.h (~1.3x speedup)
  • [x] volk_32f_x2_pow_32f.h (~2.7x speedup)
  • [x] volk_32f_sqrt_32f.h (~2x speedup with neonv8, had neon but I added neonv8)
  • [x] volk_32f_s32f_normalize.h (~2x)
  • [x] volk_32f_s32f_stddev_32f.h (only ~10% faster)
  • [x] volk_32f_stddev_and_mean_32f_x2.h (~2x)
  • [x] volk_32f_s32f_convert_16i.h (only ~10% faster, neonv8 because of rounding)
  • [x] volk_32f_s32f_convert_8i.h (6x)
  • [x] volk_32fc_s32f_magnitude_16i.h (3.2x neonv8 because sqrt and rounding)
  • [x] volk_32f_accumulator_s32f.h (~20% faster)
  • [x] volk_32f_atan_32f.h (4x faster)
  • [x] volk_32f_s32f_power_32f.h (~4x speedup, need work on tolerance)
  • [x] volk_32fc_s32f_atan2_32f.h (5x faster but small precision tolerance)
  • [x] volk_32fc_s32f_power_32fc.h (5x same as above)
  • [x] volk_64f_x2_add_64f.h (f64 only in neonv8, neon is slightly faster)
  • [x] volk_64f_x2_multiply_64f.h (~10%)
  • [x] volk_64f_x2_max_64f.h (more or less same as generic)
  • [x] volk_64f_x2_min_64f.h (more or less same as generic)
  • [x] volk_32f_64f_multiply_64f.h (generic is faster, i have done f64 in half points but they can maybe be unrolled to quarter points...)
  • [x] volk_32f_convert_64f.h (generic is faster)
  • [x] volk_64f_convert_32f.h (generic is faster)
  • [x] volk_32u_popcnt.h (not really available for SIMD)
  • [x] volk_32f_s32f_s32f_mod_range_32f.h (~2.5x)
  • [x] volk_32f_s32f_32f_fm_detect_32f.h (~2x a bit variable it seems...)
  • [x] volk_32f_x2_dot_prod_16i.h (more or less a tie)
  • [x] volk_32f_binary_slicer_32i.h (more or less a tie)
  • [x] volk_32f_s32f_calc_spectral_noise_floor_32f.h (need to find better algorithm if possible)
  • [x] volk_32fc_s32f_x2_power_spectral_density_32f.h (no puppet?)
  • [x] volk_16ic_s32f_deinterleave_real_32f.h (neon not faster)
  • [ ] volk_32fc_s32f_deinterleave_real_16i.h
  • [ ] volk_32f_x2_s32f_interleave_16ic.h
  • [ ] volk_8ic_deinterleave_real_16i.h
  • [ ] volk_8u_x4_conv_k7_r2_8u.h
  • [ ] volk_8ic_s32f_deinterleave_real_32f.h
  • [ ] volk_8ic_x2_multiply_conjugate_16ic.h
  • [ ] volk_8ic_s32f_deinterleave_32f_x2.h
  • [ ] volk_32fc_deinterleave_64f_x2.h
  • [ ] volk_32fc_deinterleave_real_64f.h
  • [ ] volk_32f_64f_add_64f.h
  • [ ] volk_16i_permute_and_scalar_add.h
  • [ ] volk_8ic_x2_s32f_multiply_conjugate_32fc.h
  • [ ] volk_16ic_deinterleave_real_16i.h
  • [ ] volk_8ic_deinterleave_16i_x2.h
  • [ ] volk_8u_x2_encodeframepolar_8u.h
  • [ ] volk_16i_branch_4_state_8.h
  • [ ] volk_8u_x3_encodepolar_8u_x2.h
  • [ ] volk_32fc_x2_s32f_square_dist_scalar_mult_32f.h
  • [ ] volk_16ic_deinterleave_16i_x2.h
  • [ ] volk_32f_8u_polarbutterfly_32f.h
  • [ ] volk_32f_index_max_16u.h
  • [ ] volk_32fc_index_max_16u.h

ast avatar Apr 27 '19 15:04 ast

Excellent! If you don't mind me asking what's your motivation? It usually helps to prioritize kernels that you care about for your application of interest unless you're just trying to learn more about NEON/SIMD

n-west avatar Apr 28 '19 23:04 n-west

We're building an SDR "hat" for the raspberry pi, tujasdr.com. I already have some private implementations so might as well try to get them into volk. The others I will probably try to implement as I go along.

Most are quite easy to do because the avx implementations are often very similar.

ast avatar Apr 29 '19 07:04 ast

@n-west are you the maintainer now?

ast avatar May 02 '19 13:05 ast

Perhaps of interest to @lemire ?

vielmetti avatar May 02 '19 17:05 vielmetti

Yes. Following.

lemire avatar May 02 '19 17:05 lemire

@vielmetti @lemire thanks for the support!

ast avatar May 03 '19 07:05 ast

@ast - Andrej (@noc0lour) is nominally maintaining volk right now. This is really cool work - thanks so much for sharing it and keeping us posted on your progress!

bhilburn avatar May 07 '19 19:05 bhilburn

I'm the only one merging PRs but everyone of @gnuradio/gr-officers is entitled to merge things.

noc0lour avatar May 09 '19 12:05 noc0lour

@ast do get your work merged once you are done please contact [email protected] for getting a CLA in place, thanks (:

noc0lour avatar May 17 '19 15:05 noc0lour

Is this list updated?. Planning to add some additional NEON support but not sure what has been completed?

dmiralles2009 avatar Aug 24 '19 21:08 dmiralles2009

I'm not sure if @ast is keeping this updated?

bhilburn avatar Aug 26 '19 15:08 bhilburn

@dmiralles2009 @bhilburn yes it's updated!

I would suggest you start at the bottom. Also I had some trouble with atan2 and the functions based on atan2, that might be interesting for you? Ping me if you have any questions!

ast avatar Aug 26 '19 15:08 ast

@ast @bhilburn thanks for the reply. I will start at the bottom then. That atan2 looks interesting, I will try to help there. Thanks again

dmiralles2009 avatar Aug 26 '19 16:08 dmiralles2009

Great! atan works fine but I lose precision in the division required in atan2... Not sure how to solve that.

ast avatar Aug 26 '19 18:08 ast

@dmiralles2009 Did you a chance to have a look at it?

ast avatar Aug 30 '19 07:08 ast

Hey @ast ..Yes, I have been looking at it. I think I will be able to put some code during the weekend. I will keep you posted :)

dmiralles2009 avatar Aug 30 '19 13:08 dmiralles2009

No stress, just tell me if you need some pointers.

I implemented these today:

  • volk_32fc_s32fc_rotate_up_halfpi_32fc
  • volk_32fc_s32f_rotate_pi_32fc
  • volk_32fc_substract_real_imag_32f
  • volk_32fc_add_real_imag_32f

ast avatar Aug 30 '19 13:08 ast

@ast Jaja, oh man you are on fire!!. I like the friendly pressure :). This week is crazy with work but I think the weekend will be good. BTW, since you offered, do you cross-compile for the RPI right? or do you compile natively on the board?. Any info on how to set up the platform will help me speed things up. Thanks

dmiralles2009 avatar Aug 30 '19 13:08 dmiralles2009

Volk I compile native on the raspberry pi 4. Just install the normal build tools. Remember you need to use the correct toolchain file (included with Volk): cmake -DCMAKE_TOOLCHAIN_FILE=../cmake/Toolchains/arm_cortex_a72_hardfp_native.cmake ..

I normally mount the device using sshfs so I can edit on my laptop.

I have rootfs on an SSD on USB3, SD-cards are painfully slow.

ast avatar Aug 30 '19 16:08 ast

@dmiralles2009 maybe this is helpful https://wiki.gnuradio.org/index.php/Cross_compile_for_Raspberry_Pi

I wrote this (@ast helped too), but still need to link it into a top-level page in the wiki.

This write-up came from this gist

igorauad avatar Aug 31 '19 01:08 igorauad

Thanks @igorauad and @ast . Got 4 protokernels done from the list, I should upload to code later today. Running late for work ....:)

dmiralles2009 avatar Sep 03 '19 12:09 dmiralles2009

Well done!! Fork volk and push your changes so we can review!

ast avatar Sep 03 '19 12:09 ast

Hey @ast, so just got a new RPI4 to further develop some volks and now I am unable to properly compile. This is odd because with RPI3+ was working. Any clues?

pi@raspberrypi:~/Documents/pi-volk/build $ cmake -DCMAKE_TOOLCHAIN_FILE=../cmake/Toolchains/arm_cortex_a72_hardfp_native.cmake ..

-- Architecture is not x86 or x86_64, Overruled arch 3dnow
-- Architecture is not x86 or x86_64, Overruled arch mmx
-- Architecture is not x86 or x86_64, Overruled arch sse
-- Architecture is not x86 or x86_64, Overruled arch sse2
-- Architecture is not x86 or x86_64, Overruled arch sse3
-- Architecture is not x86 or x86_64, Overruled arch ssse3
-- Architecture is not x86 or x86_64, Overruled arch sse4_a
-- Architecture is not x86 or x86_64, Overruled arch sse4_1
-- Architecture is not x86 or x86_64, Overruled arch sse4_2
-- Architecture is not x86 or x86_64, Overruled arch avx
-- Architecture is not x86 or x86_64, Overruled arch avx512f
-- Architecture is not x86 or x86_64, Overruled arch avx512cd
-- Performing Test neon_compile_result
-- Performing Test neon_compile_result - Success
-- Performing Test have_neonv7_result
-- Performing Test have_neonv7_result - Success
-- Performing Test have_neonv8_result
-- Performing Test have_neonv8_result - Failed
-- CPU is armv7, Overruled arch neonv8
-- ORC support not found, Overruled arch orc
-- Available architectures: generic;hardfp;neon;neonv7;norc
-- Available machines: generic;neon;neonv7_hardfp
-- BUILD TYPE = RELEASE
-- Base cflags = -O3 -DNDEBUG -march=armv8-a -mtune=cortex-a72 -mfpu=neon-fp-armv8 -mfloat-abi=hard -Wall
-- BUILD INFO ::: generic ::: GNU ::: -O3 -DNDEBUG -march=armv8-a -mtune=cortex-a72 -mfpu=neon-fp-armv8 -mfloat-abi=hard -Wall 
-- BUILD INFO ::: neon ::: GNU ::: -O3 -DNDEBUG -march=armv8-a -mtune=cortex-a72 -mfpu=neon-fp-armv8 -mfloat-abi=hard -Wall -funsafe-math-optimizations
-- BUILD INFO ::: neonv7_hardfp ::: GNU ::: -O3 -DNDEBUG -march=armv8-a -mtune=cortex-a72 -mfpu=neon-fp-armv8 -mfloat-abi=hard -Wall -funsafe-math-optimizations -mfpu=neon -funsafe-math-optimizations -mfloat-abi=hard
-- Compiler Version: gcc (Raspbian 8.3.0-6+rpi1) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-- ---- Adding ASM files
-- -- Detected neon architecture; enabling ASM
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_16i_max_star_horizontal_16i.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32f_s32f_multiply_32f_a_neonasm.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32f_x2_add_32f_a_neonasm.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32f_x2_add_32f_a_neonpipeline.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32f_x2_dot_prod_32f_a_neonasm.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32f_x2_dot_prod_32f_a_neonasm_opts.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32fc_32f_dot_prod_32fc_a_neonasm.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32fc_32f_dot_prod_32fc_a_neonasmvmla.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32fc_32f_dot_prod_32fc_a_neonpipeline.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32fc_32f_dot_prod_32fc_a_unrollasm.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32fc_x2_dot_prod_32fc_a_neonasm.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32fc_x2_dot_prod_32fc_a_neonasm_opttests.s
-- Adding source file: /home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32fc_x2_multiply_32fc_a_neonasm.s
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/gcc
-- c flags: -march=armv8-a -mtune=cortex-a72 -mfpu=neon-fp-armv8 -mfloat-abi=hard -Wall;
-- asm flags:  -mfpu=neon -g
-- c flags: -march=armv8-a -mtune=cortex-a72 -mfpu=neon-fp-armv8 -mfloat-abi=hard -Wall;
-- asm flags:  -mfpu=neon -g -mfpu=neon -g
-- Did not find liborc and orcc, disabling orc support...
-- Loading version 2.0 into constants...
-- Using install prefix: /usr/local
-- Configuring done
-- Generating done
-- Build files have been written to: /home/pi/Documents/pi-volk/build
pi@raspberrypi:~/Documents/pi-volk/build $ make
[  2%] Generating volk_machine_neonv7_hardfp.c
[  4%] Generating ../include/volk/volk.h
[  6%] Generating volk.c
[  8%] Generating ../include/volk/volk_typedefs.h
[ 10%] Generating ../include/volk/volk_cpu.h
[ 12%] Generating volk_cpu.c
[ 14%] Generating ../include/volk/volk_config_fixed.h
[ 16%] Generating volk_machines.h
[ 18%] Generating volk_machines.c
[ 20%] Generating volk_machine_generic.c
[ 22%] Generating volk_machine_neon.c
Scanning dependencies of target volk_obj
[ 25%] Building ASM object lib/CMakeFiles/volk_obj.dir/__/kernels/volk/asm/neon/volk_16i_max_star_horizontal_16i.s.o
[ 27%] Building ASM object lib/CMakeFiles/volk_obj.dir/__/kernels/volk/asm/neon/volk_32f_s32f_multiply_32f_a_neonasm.s.o
[ 29%] Building ASM object lib/CMakeFiles/volk_obj.dir/__/kernels/volk/asm/neon/volk_32f_x2_add_32f_a_neonasm.s.o
[ 31%] Building ASM object lib/CMakeFiles/volk_obj.dir/__/kernels/volk/asm/neon/volk_32f_x2_add_32f_a_neonpipeline.s.o
[ 33%] Building ASM object lib/CMakeFiles/volk_obj.dir/__/kernels/volk/asm/neon/volk_32f_x2_dot_prod_32f_a_neonasm.s.o
[ 35%] Building ASM object lib/CMakeFiles/volk_obj.dir/__/kernels/volk/asm/neon/volk_32f_x2_dot_prod_32f_a_neonasm_opts.s.o
/home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32f_x2_dot_prod_32f_a_neonasm_opts.s: Assembler messages:
/home/pi/Documents/pi-volk/kernels/volk/asm/neon/volk_32f_x2_dot_prod_32f_a_neonasm_opts.s:46: Error: selected processor does not support `sbfx r11,r1,#2,#1' in ARM mode
make[2]: *** [lib/CMakeFiles/volk_obj.dir/build.make:1639: lib/CMakeFiles/volk_obj.dir/__/kernels/volk/asm/neon/volk_32f_x2_dot_prod_32f_a_neonasm_opts.s.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:159: lib/CMakeFiles/volk_obj.dir/all] Error 2
make: *** [Makefile:141: all] Error 2
pi@raspberrypi:~/Documents/pi-volk/build $ ```

dmiralles2009 avatar Sep 05 '19 02:09 dmiralles2009

Yeah ok this is a known bug, thought I already had submitted a change for that..? My toolchain file looks like this. Notice CMAKE_ASM_FLAGS and -mthumb which is needed to compile that particular asm file that is failing.

set(CMAKE_CXX_COMPILER g++)
set(CMAKE_C_COMPILER  gcc)
set(CMAKE_CXX_FLAGS "-ffast-math -march=armv8-a -mtune=cortex-a72 -mfpu=neon-fp-armv8 -mfloat-abi=hard" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS ${CMAKE_CXX_FLAGS} CACHE STRING "" FORCE) #same flags for C sources
set(CMAKE_ASM_FLAGS "${CMAKE_CXX_FLAGS} -mthumb -g" CACHE STRING "" FORCE) #same flags for asm sources

You might have to remove the lineset(ARCH_ASM_FLAGS "-mfpu=neon -g") in lib/CMakeLists.txt also so the variable is not overwritten.

ast avatar Sep 06 '19 13:09 ast

Hi @ast , that was a good fix. I do not think the changes are available in volk (master). Maybe they are on your repo. I switched development to my RPI 3b+ board, I was unable to get a 64 bit OS functional in RPI 4. Have you been luckier in this regard?. I did complete a couple of proto-kernels

  1. volk_8ic_s32f_deinterleave_real_32f.h
  2. volk_8ic_x2_multiply_conjugate_16ic.h
  3. volk_8ic_s32f_deinterleave_32f_x2.h
  4. volk_32fc_deinterleave_64f_x2.h
  5. volk_32fc_deinterleave_real_64f.h
  6. volk_32f_64f_add_64f.h

but I have been trying to run those in rpi4 to validate results. I should be pushing then soon.

dmiralles2009 avatar Sep 09 '19 01:09 dmiralles2009

I had Ubuntu aarch64 working very well on rbpi3 using the raspbian bootloader (to correctly load the device tree). If I remember correctly I also added some users/groups that raspberry pi software expects.

Just try to edit the Cmake and toolchain flags to add the -mthumb flag. OR you can just rename/remove the assembly file that fails on Raspberry Pi 4, this is the easiest fix, you don't need it anyway.

On Mon, Sep 9, 2019 at 3:09 AM Damian Miralles [email protected] wrote:

Hi @ast https://github.com/ast , that was a good fix. I do not think the changes are available in volk (master). Maybe they are on your repo. I switched development to my RPI 3b+ board, I was unable to get a 64 bit OS functional in RPI 4. Have you been luckier in this regard?. I did complete a couple of proto-kernels

  1. volk_8ic_s32f_deinterleave_real_32f.h
  2. volk_8ic_x2_multiply_conjugate_16ic.h
  3. volk_8ic_s32f_deinterleave_32f_x2.h
  4. volk_32fc_deinterleave_64f_x2.h
  5. volk_32fc_deinterleave_real_64f.h
  6. volk_32f_64f_add_64f.h

but I have been trying to run those in rpi4 to validate results. I should be pushing then soon.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gnuradio/volk/issues/243?email_source=notifications&email_token=AAA7YBSH46FXDHJVNSHDNRTQIWO3DA5CNFSM4HI4O4BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6F7QII#issuecomment-529266721, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA7YBWU2I4N7WSJQ7J4NNTQIWO3DANCNFSM4HI4O4BA .

ast avatar Sep 09 '19 09:09 ast

@ast Is this list up to date? If not, can you get it updated? Thanks for your work getting NEON kernels into Volk!

michaelld avatar Nov 14 '19 01:11 michaelld

It's up to date! All the code is in my linked repo. I have limited time to create PRs so everyone is welcome to look at/improve upon my work and create PRs!

ast avatar Nov 15 '19 21:11 ast

I also have completed the proto kernel I listed but I am busy with paying jobs. In a couple of days, I will submit a new PR.

@ast you ok if I submit your code changes on your behalf?. Having those on the mainstream repo will be quite helpful.

Best, Damian

On Fri, Nov 15, 2019 at 3:05 PM Albin Stigo [email protected] wrote:

It's up to date! All the code is in my linked repo. I have limited time to create PRs so everyone is welcome to look at/improve upon my work and create PRs!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gnuradio/volk/issues/243?email_source=notifications&email_token=AB2KYSB5XYXQYYYIQFBWWU3QT4FKBA5CNFSM4HI4O4BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEGW2UY#issuecomment-554528083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2KYSEL7QD66CKBPVF7V6LQT4FKBANCNFSM4HI4O4BA .

-- Thanks Damian Miralles

dmiralles2009 avatar Nov 16 '19 17:11 dmiralles2009

@dmiralles2009 yes go ahead! That would be great! Please update the list if you do! Or ping me and I'll update it...

ast avatar Nov 16 '19 19:11 ast