simde
simde copied to clipboard
optimize NEON functions required for libjpeg-turbo
@kleisauke is trying to get libjpeg-turbo working on WASM using SIMDe. Here is a list of functions which aren't implemented yet:
- [x] vaddhn_s32
- [x] vld1q_dup_s16
- [x] vld1q_lane_s16
- [x] vld2_u8
- [x] vmlal_lane_s16
- [x] vmlal_lane_u16
- [x] vmlsl_lane_s16
- [x] vmlsl_lane_u16
- [x] vmull_lane_s16
- [x] vmull_lane_u16
- [x] vqdmulh_lane_s16
- [x] vqdmulhq_lane_s16
- [x] vqrdmulhq_lane_s16
- [x] vqrshrn_n_s16
- [x] vqshluq_n_s16
- [x] vqshrn_n_s16
- [x] vrshrn_n_s32
- [x] vrshrn_n_u16
- [x] vrshrn_n_u32
- [x] vshll_n_s16
- [x] vshrn_n_s32
- [x] vshrn_n_u16
- [x] vshrn_n_u32
- [x] vsriq_n_u16
- [x] vst2_lane_u16
- [x] vst2q_u8
- [x] vst3_lane_u8
- [x] vst4_lane_u16
- [x] vst4_lane_u8
- [x] vsubhn_s32
Here's a list of completed functions with their corresponding commits:
- [X] vaddhn_s32 (commit https://github.com/simd-everywhere/simde/commit/e9ee0666356a60f28f5be248cf4de37be24e4a95)
- [X] vld1q_dup_s16 (commit https://github.com/simd-everywhere/simde/commit/650d5310baec682d9c5545d668554b8791b93a96)
- [X] vld1q_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/9051a51c20c077f9a76be1ddf3c217e9bb9ad845)
- [X] vld2_u8 (commit https://github.com/simd-everywhere/simde/commit/85d2ed2449992c5897bb9c01977fc7f060bbcd7c)
- [X] vmlal_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/82e36eda0774c7384f19edf2220374ca23eadeca)
- [X] vmlal_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/82e36eda0774c7384f19edf2220374ca23eadeca)
- [X] vmlsl_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/de78ae9f1562dfa0c1922c8dff3a5974143acb10)
- [X] vmlsl_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/de78ae9f1562dfa0c1922c8dff3a5974143acb10)
- [X] vmull_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/4dd488d3dc5da2e4e89b5489935df6c1c415d9de)
- [X] vmull_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/4dd488d3dc5da2e4e89b5489935df6c1c415d9de)
- [X] vqdmulh_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/79dc1eec5f3c6bd57a02d29636180da22c62a228)
- [X] vqdmulhq_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/79dc1eec5f3c6bd57a02d29636180da22c62a228)
- [X] vqrdmulhq_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/dc2ea7500c21f8167a2c8191a1556cd3bb819ab7)
- [X] vqrshrn_n_s16 (commit https://github.com/simd-everywhere/simde/commit/2595b3e46627e58356e4a7c1d61ccbeadd7edb58)
- [X] vqshluq_n_s16 (commit https://github.com/simd-everywhere/simde/commit/77af9f12e93eacd5cf107faaf7c244d46c5c167f)
- [X] vqshrn_n_s16 (commit https://github.com/simd-everywhere/simde/commit/d9260dc441b487f80db4b1b58dd49cee5ba1cfa1)
- [X] vrshrn_n_s32 (commit https://github.com/simd-everywhere/simde/commit/a70371126cf1fd2f31dfd50e487b5b2c21a742d2)
- [X] vrshrn_n_u16 (commit https://github.com/simd-everywhere/simde/commit/a70371126cf1fd2f31dfd50e487b5b2c21a742d2)
- [X] vrshrn_n_u32 (commit https://github.com/simd-everywhere/simde/commit/a70371126cf1fd2f31dfd50e487b5b2c21a742d2)
- [X] vshll_n_s16 (commit https://github.com/simd-everywhere/simde/commit/98ac861a48e1ed2a440e55465413fc91e5cabee0)
- [X] vshrn_n_s32 (commit https://github.com/simd-everywhere/simde/commit/8810cdd6445dd2b04df3b5033ac6b5d0d8d68f2d)
- [X] vshrn_n_u16 (commit https://github.com/simd-everywhere/simde/commit/8810cdd6445dd2b04df3b5033ac6b5d0d8d68f2d)
- [X] vshrn_n_u32 (commit https://github.com/simd-everywhere/simde/commit/8810cdd6445dd2b04df3b5033ac6b5d0d8d68f2d)
- [X] vsriq_n_u16 (commit https://github.com/simd-everywhere/simde/commit/aa832e1ec9146cdede6b4df2146fa0b5138ec41c)
- [X] vst2_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/8ee1eb412fbed783d0cc4ef80c1b8a75b1208baa)
- [X] vst2q_u8 (commit https://github.com/simd-everywhere/simde/commit/1e38dcbc63d748b303055f086118ca2bd6cf84ac)
- [X] vst3_lane_u8 (commit https://github.com/simd-everywhere/simde/commit/ae308b20867bc7d0f5761f816d448a9f48ad5ad1)
- [X] vst4_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/b231820afcdaf8858b726a04476410c363090ed0)
- [X] vst4_lane_u8 (commit https://github.com/simd-everywhere/simde/commit/b231820afcdaf8858b726a04476410c363090ed0)
- [X] vsubhn_s32 (commit https://github.com/simd-everywhere/simde/commit/ca6275412df3471690285f4758e2031b93537df7)
Thanks for the reminder! I added some more earlier today, and we'll try to get that last one done soon; I think @Glitch18 is planning to take care of it.
Yes. Will be pushing the commit soon!
BTW, once this is done I'd be very interested in any performance data which could point us to something we might be able to optimize in SIMDe. See https://github.com/simd-everywhere/simde/wiki/Performance-Tuning#finding-performance-problems
Great, thanks! I'll re-run the benchmark in test/bench
within Chrome/Firefox once this is done. For Node.js, this requires an update of V8 to 9.1 (https://github.com/nodejs/node/pull/38273) to match the renumbered/finalized WASM SIMD opcodes.
vqshluq_n_s16
was implemented with commit https://github.com/simd-everywhere/simde/commit/77af9f12e93eacd5cf107faaf7c244d46c5c167f, which makes it possible to compile libjpeg-turbo for WebAssembly with SIMD support (by reusing the Arm Neon intrinsics, see commit https://github.com/kleisauke/wasm-vips/commit/acd4c8128bcb195fed8724e82c41f93014aea30d). :tada:
I'll re-run the benchmarks and post the results soon, feel free to close this issue.
First set of benchmarking/profiling results can be found here: test/bench/README-simde.md
.
It seems that reusing the Arm Neon intrinsics for WASM made it ~3.5x slower than its C implementation (on this benchmark). The most number of ticks (>= 10) can be observed in these functions (ordered from high to low):
- simde_vshlq_u16
- simde_vld3_u8
- simde_vld4q_s16
- simde_vclzq_s16
- simde_vld1q_lane_s16
- simde_vtrn1q_s32
- simde_vtrn2q_s32
- simde_vtrn1q_s16
Note that libjpeg-turbo is considering a whole new SIMD implementation just for WASM, so please don't spend too much time on this.