ARM: big performance impact with enabled neon but -mfloat-abi=softfp
I've tried to make little benchmark using QEMU under Ubuntu 18.04. Because I want measure aarch64 (which I do not have) + it's easier measure small differences (every run is deterministic). Runtime CPU detection do not work with QEMU semihosting, so I enable it manually:
@@ -142,11 +142,11 @@ opus_uint32 opus_cpu_capabilities(void)
# endif
}
fclose(cpuinfo);
}
- return flags;
+ return flags | OPUS_CPU_ARM_EDSP_FLAG | OPUS_CPU_ARM_MEDIA_FLAG | OPUS_CPU_ARM_NEON_FLAG;
}
#else
/* The feature registers which can tell us what the processor supports are
* accessible in priveleged modes only, so we can't have a general user-space
* detection method like on x86.*/
Both variants with neon optimized functions and without is measured.
First I've notice that when I build with arm-linux-gnueabi-gcc:
CC=arm-linux-gnueabi-gcc CFLAGS="-O2 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=softfp" ../configure --host=arm-linux-gnueabi --disable-shared
Runtime float point functions heavily used:
Total executed instructions: 9334798638
__aeabi_dadd 1095866862 11.740% <--- libgcc float point support
celt_encode_with_ec 745843070 7.990%
opus_fft_impl 712053747 7.628%
tonality_analysis.isra.0 629640675 6.745%
celt_pitch_xcorr_float_neon 607909464 6.512%
op_pvq_search_c 442786713 4.743%
__aeabi_dmul 431374552 4.621%
compute_gru 342358288 3.668%
__subsf3 327752096 3.511%
This functions come not by itself but form math functions like log10, cos, pow, sqrt. Note that double variants used (for example in silk_process_gains_FLP, silk_noise_shape_analysis_FLP.c). Then I try --enable-float-approx and it somehow helps:
Total executed instructions: 8746436647
__aeabi_dadd 784876796 8.974%
celt_encode_with_ec 749657806 8.571%
opus_fft_impl 712053747 8.141%
tonality_analysis.isra.0 629270601 7.195%
celt_pitch_xcorr_float_neon 607909464 6.950%
op_pvq_search_c 442807791 5.063%
compute_gru 340887999 3.897%
__subsf3 327752096 3.747%
__aeabi_dmul 323995896 3.704%
But still math functions (even double) heavily used. Then I tried to replace all double functions to float equivalents log10f, cosf etc. Except initialization functions. And this helps even more:
Total executed instructions: 8462849809
celt_encode_with_ec 749638691 8.858%
opus_fft_impl 712053747 8.414%
tonality_analysis.isra.0 629040555 7.433%
celt_pitch_xcorr_float_neon 607909464 7.183%
__adddf3 601537247 7.108% <---- still here
op_pvq_search_c 442807791 5.232%
compute_gru 340887999 4.028%
__addsf3 294458581 3.479%
__aeabi_dmul 267024861 3.155%
But still big difference with -mfloat-abi=hard:
Total executed instructions: 7311795843
celt_encode_with_ec 797797095 10.911%
opus_fft_impl 712338094 9.742%
tonality_analysis.isra.0 645539978 8.829%
celt_pitch_xcorr_float_neon 609168730 8.331%
op_pvq_search_c 441154759 6.033%
compute_gru 382336287 5.229%
pitch_downsample 248975975 3.405%
haar1 246039992 3.365%
clt_mdct_forward_c 233810038 3.198%
main 165168530 2.259%
silk_biquad_float 163545392 2.237%
find_best_pitch 157104294 2.149%
dual_inner_prod_neon 156394521 2.139%
celt_inner_prod_neon 156178844 2.136%
__lrintf 150633156 2.060%
aarch64 and aarch64 + flto gives another speedups:
Total executed instructions: 6078286128
So:
- NEON optimizations gives not so much speedup. Non of optimizations gives too much, but all of them gives 42% (arm-noneon vs aarch64-flto with double functions fix).
- -mfloat-abi=softfp have big performance penalty, because of float math functions used while encoding in porcess.
- Of course QEMU approach is in question, it measures instructions, not clockticks. But neon vs non-neon should be measured perfectly by instructions (I've measure nearly 4x on minimp3). And libgcc __aeabi* usage instead of neon\vfp should definetly harm hardware performance too.
Here all profiling info: profile.zip
Ehrm.. Just for the records, did you try to also force mfpu=vfp or mfpu=neon in the softfp scenario?
@mirh Yes, I tried -mfpu=neon:
CC=arm-linux-gnueabi-gcc CFLAGS="-O2 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=softfp" ../configure --host=arm-linux-gnueabi --disable-shared
Android NDK works fine with -mfloat-abi=softfp and -mfpu=neon, but with regular gcc seems slow runtime was enabled which ruins performance. For now I do not know how to bypass this.
Isn't the NDK using llvm now? Clang has different conventions than gcc AFAIR. Also, this doesn't always use NEON everywhere it could, to retain IEEE 754 compliance. And maybe softfp isn't generating VFP instructions in that case? (explaining why the __adddf3 path is hit)
At time of testing both gcc and clang was available in Android NDK. And both was fine.
Yes this runtime functions looks like software-only emulation https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html without any fallback https://github.com/lattera/glibc/blob/master/soft-fp/addsf3.c .
But -mfloat-abi=softfp should tell compiler how to pass parameters through functions (which affects ABI), not how turn neon instruction into runtime call (not affects any ABI). So it looks like a bug.
I do not remember now, I think I tested -O3 which should include unsafe math, may be worth re-testing.
I tested -O3 which should include unsafe math
-O3 doesn't automatically include unsafe math, you have to manually specify it, e.g. with -ffast-math.
Re-tested with CC=arm-linux-gnueabi-gcc CFLAGS="-O3 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=softfp -ffast-math" ../configure --host=arm-linux-gnueabi --disable-shared --enable-float-approx, gcc-9 and latest git:
Total executed instructions: 16372422266
silk_NSQ_del_dec_c 4974089295 30.381%
silk_warped_autocorrelation_FLP 1405601164 8.585%
__aeabi_fadd 956778274 5.844%
opus_fft_impl 615728072 3.761%
silk_inner_product_FLP 579407231 3.539%
silk_NLSF_del_dec_quant 511846946 3.126%
tonality_analysis.isra.0 470646965 2.875%
lrintf32 412739270 2.521%
celt_encode_with_ec 394644592 2.410%
op_pvq_search_c 353554551 2.159%
silk_resampler_private_down_FIR 348786294 2.130%
celt_pitch_xcorr_float_neon 289573663 1.769%
compute_gru 281796348 1.721%
clt_mdct_forward_c 243016580 1.484%
opus_encode_native 236672971 1.446%
memcpy 221032868 1.350%
__aeabi_dsub 217997193 1.331%
silk_noise_shape_quantizer_short_prediction_neon 217803520 1.330%
silk_burg_modified_FLP 208238985 1.272%
__aeabi_dmul 190411584 1.163%
silk_A2NLSF 181044753 1.106%
downmix_and_resample 154881928 0.946%
silk_LPC_analysis_filter_FLP 143910090 0.879%
silk_LPC_inverse_pred_gain_neon 116848424 0.714%
haar1 106113071 0.648%
celt_inner_prod_neon 104608845 0.639%
__subsf3 103312867 0.631%
silk_schur_FLP 99005093 0.605%
silk_resampler_private_AR2 95528863 0.583%
compute_dense 84251784 0.515%
silk_NLSF_encode 81895880 0.500%
Latest code is slower, silk_NSQ_del_dec_c now in top. But __aeabi math functions still called with and without -ffast-math.
can you do a run with fixed_point enabled?
Yep, here for CC=arm-linux-gnueabi-gcc CFLAGS="-O3 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=softfp -ffast-math" ../configure --host=arm-linux-gnueabi --disable-shared --enable-fixed-point:
Total executed instructions: 13012163821
silk_NSQ_del_dec_neon 2982368100 22.920%
opus_fft_impl 923474548 7.097%
silk_LPC_analysis_filter 861135948 6.618%
silk_warped_autocorrelation_FIX_neon 635535450 4.884%
tonality_analysis.isra.0 532551793 4.093%
op_pvq_search_c 506238980 3.891%
silk_NLSF_del_dec_quant 502276622 3.860%
celt_encode_with_ec 483959329 3.719%
silk_biquad_alt_stride1 346087880 2.660%
__aeabi_frsub 322691329 2.480%
silk_burg_modified_c 309520962 2.379%
clt_mdct_forward_c 308241366 2.369%
compute_gru 278675724 2.142%
xcorr_kernel_neon 231527283 1.779%
silk_resampler_private_down_FIR 196245915 1.508%
pitch_search 169016229 1.299%
silk_resampler_down2_hp 163515380 1.257%
haar1 133898209 1.029%
lrintf 129599952 0.996%
silk_A2NLSF 121118929 0.931%
silk_schur64 119365290 0.917%
pitch_downsample 103556926 0.796%
silk_LPC_inverse_pred_gain_neon 94522636 0.726%
encode_pulses 88056188 0.677%
find_best_pitch 82874813 0.637%
compute_dense 82862079 0.637%
silk_sum_sqr_shift 79678600 0.612%
compute_band_energies 77882956 0.599%
__aeabi_dsub 74824474 0.575%
exp_rotation1.constprop.0 73685266 0.566%
normalise_bands 72065118 0.554%
opus_encode_native 71532445 0.550%
main 69646414 0.535%
celt_preemphasis 67835676 0.521%
silk_NLSF_encode 65654040 0.505%
much better.
And I guess you can --disable-intrinsic to see if the perf comes from more neon optimizations in fixed point or if the fixed point implementation is more efficient running on the arch. (at least on x86 there is more SIMD for fixed point than float)
Here for CC=arm-linux-gnueabi-gcc CFLAGS="-O3 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=softfp -ffast-math" ../configure --host=arm-linux-gnueabi --disable-shared --enable-fixed-point --disable-intrinsics:
Total executed instructions: 15237908565
silk_NSQ_del_dec_c 3193163430 20.955%
silk_warped_autocorrelation_FIX_c 2142968714 14.063%
opus_fft_impl 923474548 6.060%
silk_LPC_analysis_filter 861135948 5.651%
tonality_analysis.isra.0 532551793 3.495%
op_pvq_search_c 506238980 3.322%
silk_NLSF_del_dec_quant 502276622 3.296%
celt_encode_with_ec 483959329 3.176%
silk_biquad_alt_stride1 346087880 2.271%
remove_doubling 310240195 2.036%
silk_burg_modified_c 309520962 2.031%
clt_mdct_forward_c 308241366 2.023%
__addsf3 289224637 1.898%
compute_gru 278675724 1.829%
xcorr_kernel_neon 231527283 1.519%
memcpy 202760657 1.331%
silk_resampler_private_down_FIR 196245915 1.288%
pitch_search 169016229 1.109%
silk_resampler_down2_hp 163515380 1.073%
silk_LPC_inverse_pred_gain_c 162644313 1.067%
haar1 133898209 0.879%
lrintf 129599952 0.851%
silk_A2NLSF 121118929 0.795%
silk_schur64 119365290 0.783%
pitch_downsample 103556926 0.680%
encode_pulses 88056188 0.578%
find_best_pitch 82874813 0.544%
compute_dense 82862079 0.544%
silk_sum_sqr_shift 79678600 0.523%
compute_band_energies 77882956 0.511%
Then run floating point with disabled intrinsics. Are you also running complexity 0? The test includes both encoder and decoder?
I've updated scripts and results:
opus_prof.zip
Default complexity (10) and only encoder are used: opus_demo -e voip 48000 1 MEANDR_PHASE0.raw out.opus
Here float point results with and without intrinsics/softfp:
CC=arm-linux-gnueabi-gcc CFLAGS="-O3 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=softfp -ffast-math" \
../configure --host=arm-linux-gnueabi --disable-shared --enable-float-approx:
Total executed instructions: 16372422276
silk_NSQ_del_dec_c 4974089295 30.381%
silk_warped_autocorrelation_FLP 1405601164 8.585%
__aeabi_fadd 956778274 5.844%
opus_fft_impl 615728072 3.761%
silk_inner_product_FLP 579407231 3.539%
silk_NLSF_del_dec_quant 511846946 3.126%
tonality_analysis.isra.0 470646965 2.875%
lrintf32 412739270 2.521%
celt_encode_with_ec 394644592 2.410%
op_pvq_search_c 353554551 2.159%
silk_resampler_private_down_FIR 348786294 2.130%
celt_pitch_xcorr_float_neon 289573663 1.769%
compute_gru 281796348 1.721%
clt_mdct_forward_c 243016580 1.484%
opus_encode_native 236672971 1.446%
memcpy 221032868 1.350%
__aeabi_dsub 217997193 1.331%
silk_noise_shape_quantizer_short_prediction_neon 217803520 1.330%
silk_burg_modified_FLP 208238985 1.272%
__aeabi_dmul 190411584 1.163%
silk_A2NLSF 181044753 1.106%
downmix_and_resample 154881928 0.946%
silk_LPC_analysis_filter_FLP 143910090 0.879%
silk_LPC_inverse_pred_gain_neon 116848424 0.714%
haar1 106113071 0.648%
celt_inner_prod_neon 104608845 0.639%
__subsf3 103312867 0.631%
silk_schur_FLP 99005093 0.605%
silk_resampler_private_AR2 95528863 0.583%
compute_dense 84251784 0.515%
silk_NLSF_encode 81895880 0.500%
CC=arm-linux-gnueabi-gcc CFLAGS="-O3 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=softfp -ffast-math" \
../configure --host=arm-linux-gnueabi --disable-shared --enable-float-approx --disable-intrinsics:
Total executed instructions: 17666024223
silk_NSQ_del_dec_c 6128708172 34.692%
silk_warped_autocorrelation_FLP 1405601164 7.957%
__subsf3 1060091141 6.001%
celt_pitch_xcorr_c 635004022 3.594%
opus_fft_impl 615728072 3.485%
silk_inner_product_FLP 579407231 3.280%
silk_NLSF_del_dec_quant 511846946 2.897%
tonality_analysis.isra.0 470646965 2.664%
lrintf 412739270 2.336%
celt_encode_with_ec 394440923 2.233%
op_pvq_search_c 353554551 2.001%
silk_resampler_private_down_FIR 348786294 1.974%
compute_gru 281796348 1.595%
opus_encode_native 246681190 1.396%
clt_mdct_forward_c 243016580 1.376%
memcpy 221032868 1.251%
__aeabi_dadd 211227705 1.196%
silk_burg_modified_FLP 208238985 1.179%
__aeabi_dmul 190411584 1.078%
silk_LPC_inverse_pred_gain_c 189218048 1.071%
silk_A2NLSF 181044753 1.025%
downmix_and_resample 154881928 0.877%
silk_LPC_analysis_filter_FLP 143910090 0.815%
haar1 106113071 0.601%
silk_schur_FLP 99005093 0.560%
remove_doubling 97133184 0.550%
silk_resampler_private_AR2 95528863 0.541%
CC=arm-linux-gnueabihf-gcc CFLAGS="-O3 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=hard -ffast-math" \
../configure --host=arm-linux-gnueabihf --disable-shared --enable-float-approx:
Total executed instructions: 15245827537
silk_NSQ_del_dec_c 4999249968 32.791%
silk_warped_autocorrelation_FLP 1405798768 9.221%
opus_fft_impl 636159440 4.173%
silk_inner_product_FLP 579407231 3.800%
silk_NLSF_del_dec_quant 521500930 3.421%
tonality_analysis.isra.0 486599604 3.192%
__lrintf 469151111 3.077%
celt_encode_with_ec 405671331 2.661%
op_pvq_search_c 356456242 2.338%
silk_resampler_private_down_FIR 345330522 2.265%
compute_gru 330536094 2.168%
__memcpy_neon 320319026 2.101%
celt_pitch_xcorr_float_neon 289558675 1.899%
clt_mdct_forward_c 243249764 1.596%
opus_encode_native 242917018 1.593%
silk_noise_shape_quantizer_short_prediction_neon 224829440 1.475%
silk_burg_modified_FLP 205952949 1.351%
silk_A2NLSF 186933662 1.226%
downmix_and_resample 154981968 1.017%
silk_LPC_analysis_filter_FLP 141897774 0.931%
silk_LPC_inverse_pred_gain_neon 127314024 0.835%
haar1 105766803 0.694%
celt_inner_prod_neon 103437959 0.678%
silk_schur_FLP 99114873 0.650%
encode_pulses 93921441 0.616%
compute_dense 93150057 0.611%
silk_resampler_private_AR2 90204225 0.592%
silk_NLSF_encode 81824523 0.537%
pitch_downsample 80682054 0.529%
silk_ana_filt_bank_1 79275930 0.520%
CC=arm-linux-gnueabihf-gcc CFLAGS="-O3 -mcpu=cortex-a15 -mfpu=neon -mfloat-abi=hard -ffast-math" \
../configure --host=arm-linux-gnueabihf --disable-shared --enable-float-approx --disable-intrinsics:
Total executed instructions: 16538673928
silk_NSQ_del_dec_c 6146607935 37.165%
silk_warped_autocorrelation_FLP 1405798768 8.500%
opus_fft_impl 636159440 3.846%
celt_pitch_xcorr_c 635385974 3.842%
silk_inner_product_FLP 579407231 3.503%
silk_NLSF_del_dec_quant 521500930 3.153%
tonality_analysis.isra.0 486589602 2.942%
lrintf32 469151111 2.837%
celt_encode_with_ec 405671331 2.453%
op_pvq_search_c 356456242 2.155%
silk_resampler_private_down_FIR 345330522 2.088%
compute_gru 330536094 1.999%
__memcpy_neon 320319026 1.937%
opus_encode_native 252848603 1.529%
clt_mdct_forward_c 243249764 1.471%
silk_LPC_inverse_pred_gain_c 210567872 1.273%
silk_burg_modified_FLP 205952949 1.245%
silk_A2NLSF 186933662 1.130%
downmix_and_resample 154981968 0.937%
silk_LPC_analysis_filter_FLP 141897774 0.858%
haar1 105766803 0.640%
remove_doubling 100027556 0.605%
silk_schur_FLP 99114873 0.599%
encode_pulses 93921441 0.568%
compute_dense 93150057 0.563%
silk_resampler_private_AR2 90204225 0.545%
So, softfp impact is already greatly reduced. But may be complexity should be reduced to achieve 7-9B instructions instead of 15-17B as before? What is recommended settings?