simde icon indicating copy to clipboard operation
simde copied to clipboard

optimize NEON functions required for libjpeg-turbo

Open nemequ opened this issue 3 years ago • 7 comments

@kleisauke is trying to get libjpeg-turbo working on WASM using SIMDe. Here is a list of functions which aren't implemented yet:

  • [x] vaddhn_s32
  • [x] vld1q_dup_s16
  • [x] vld1q_lane_s16
  • [x] vld2_u8
  • [x] vmlal_lane_s16
  • [x] vmlal_lane_u16
  • [x] vmlsl_lane_s16
  • [x] vmlsl_lane_u16
  • [x] vmull_lane_s16
  • [x] vmull_lane_u16
  • [x] vqdmulh_lane_s16
  • [x] vqdmulhq_lane_s16
  • [x] vqrdmulhq_lane_s16
  • [x] vqrshrn_n_s16
  • [x] vqshluq_n_s16
  • [x] vqshrn_n_s16
  • [x] vrshrn_n_s32
  • [x] vrshrn_n_u16
  • [x] vrshrn_n_u32
  • [x] vshll_n_s16
  • [x] vshrn_n_s32
  • [x] vshrn_n_u16
  • [x] vshrn_n_u32
  • [x] vsriq_n_u16
  • [x] vst2_lane_u16
  • [x] vst2q_u8
  • [x] vst3_lane_u8
  • [x] vst4_lane_u16
  • [x] vst4_lane_u8
  • [x] vsubhn_s32

nemequ avatar Nov 23 '20 00:11 nemequ

Here's a list of completed functions with their corresponding commits:

  • [X] vaddhn_s32 (commit https://github.com/simd-everywhere/simde/commit/e9ee0666356a60f28f5be248cf4de37be24e4a95)
  • [X] vld1q_dup_s16 (commit https://github.com/simd-everywhere/simde/commit/650d5310baec682d9c5545d668554b8791b93a96)
  • [X] vld1q_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/9051a51c20c077f9a76be1ddf3c217e9bb9ad845)
  • [X] vld2_u8 (commit https://github.com/simd-everywhere/simde/commit/85d2ed2449992c5897bb9c01977fc7f060bbcd7c)
  • [X] vmlal_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/82e36eda0774c7384f19edf2220374ca23eadeca)
  • [X] vmlal_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/82e36eda0774c7384f19edf2220374ca23eadeca)
  • [X] vmlsl_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/de78ae9f1562dfa0c1922c8dff3a5974143acb10)
  • [X] vmlsl_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/de78ae9f1562dfa0c1922c8dff3a5974143acb10)
  • [X] vmull_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/4dd488d3dc5da2e4e89b5489935df6c1c415d9de)
  • [X] vmull_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/4dd488d3dc5da2e4e89b5489935df6c1c415d9de)
  • [X] vqdmulh_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/79dc1eec5f3c6bd57a02d29636180da22c62a228)
  • [X] vqdmulhq_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/79dc1eec5f3c6bd57a02d29636180da22c62a228)
  • [X] vqrdmulhq_lane_s16 (commit https://github.com/simd-everywhere/simde/commit/dc2ea7500c21f8167a2c8191a1556cd3bb819ab7)
  • [X] vqrshrn_n_s16 (commit https://github.com/simd-everywhere/simde/commit/2595b3e46627e58356e4a7c1d61ccbeadd7edb58)
  • [X] vqshluq_n_s16 (commit https://github.com/simd-everywhere/simde/commit/77af9f12e93eacd5cf107faaf7c244d46c5c167f)
  • [X] vqshrn_n_s16 (commit https://github.com/simd-everywhere/simde/commit/d9260dc441b487f80db4b1b58dd49cee5ba1cfa1)
  • [X] vrshrn_n_s32 (commit https://github.com/simd-everywhere/simde/commit/a70371126cf1fd2f31dfd50e487b5b2c21a742d2)
  • [X] vrshrn_n_u16 (commit https://github.com/simd-everywhere/simde/commit/a70371126cf1fd2f31dfd50e487b5b2c21a742d2)
  • [X] vrshrn_n_u32 (commit https://github.com/simd-everywhere/simde/commit/a70371126cf1fd2f31dfd50e487b5b2c21a742d2)
  • [X] vshll_n_s16 (commit https://github.com/simd-everywhere/simde/commit/98ac861a48e1ed2a440e55465413fc91e5cabee0)
  • [X] vshrn_n_s32 (commit https://github.com/simd-everywhere/simde/commit/8810cdd6445dd2b04df3b5033ac6b5d0d8d68f2d)
  • [X] vshrn_n_u16 (commit https://github.com/simd-everywhere/simde/commit/8810cdd6445dd2b04df3b5033ac6b5d0d8d68f2d)
  • [X] vshrn_n_u32 (commit https://github.com/simd-everywhere/simde/commit/8810cdd6445dd2b04df3b5033ac6b5d0d8d68f2d)
  • [X] vsriq_n_u16 (commit https://github.com/simd-everywhere/simde/commit/aa832e1ec9146cdede6b4df2146fa0b5138ec41c)
  • [X] vst2_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/8ee1eb412fbed783d0cc4ef80c1b8a75b1208baa)
  • [X] vst2q_u8 (commit https://github.com/simd-everywhere/simde/commit/1e38dcbc63d748b303055f086118ca2bd6cf84ac)
  • [X] vst3_lane_u8 (commit https://github.com/simd-everywhere/simde/commit/ae308b20867bc7d0f5761f816d448a9f48ad5ad1)
  • [X] vst4_lane_u16 (commit https://github.com/simd-everywhere/simde/commit/b231820afcdaf8858b726a04476410c363090ed0)
  • [X] vst4_lane_u8 (commit https://github.com/simd-everywhere/simde/commit/b231820afcdaf8858b726a04476410c363090ed0)
  • [X] vsubhn_s32 (commit https://github.com/simd-everywhere/simde/commit/ca6275412df3471690285f4758e2031b93537df7)

kleisauke avatar Jun 03 '21 09:06 kleisauke

Thanks for the reminder! I added some more earlier today, and we'll try to get that last one done soon; I think @Glitch18 is planning to take care of it.

nemequ avatar Jun 03 '21 19:06 nemequ

Yes. Will be pushing the commit soon!

Glitch18 avatar Jun 03 '21 19:06 Glitch18

BTW, once this is done I'd be very interested in any performance data which could point us to something we might be able to optimize in SIMDe. See https://github.com/simd-everywhere/simde/wiki/Performance-Tuning#finding-performance-problems

nemequ avatar Jun 03 '21 19:06 nemequ

Great, thanks! I'll re-run the benchmark in test/bench within Chrome/Firefox once this is done. For Node.js, this requires an update of V8 to 9.1 (https://github.com/nodejs/node/pull/38273) to match the renumbered/finalized WASM SIMD opcodes.

kleisauke avatar Jun 03 '21 20:06 kleisauke

vqshluq_n_s16 was implemented with commit https://github.com/simd-everywhere/simde/commit/77af9f12e93eacd5cf107faaf7c244d46c5c167f, which makes it possible to compile libjpeg-turbo for WebAssembly with SIMD support (by reusing the Arm Neon intrinsics, see commit https://github.com/kleisauke/wasm-vips/commit/acd4c8128bcb195fed8724e82c41f93014aea30d). :tada:

I'll re-run the benchmarks and post the results soon, feel free to close this issue.

kleisauke avatar Jun 11 '21 16:06 kleisauke

First set of benchmarking/profiling results can be found here: test/bench/README-simde.md.

It seems that reusing the Arm Neon intrinsics for WASM made it ~3.5x slower than its C implementation (on this benchmark). The most number of ticks (>= 10) can be observed in these functions (ordered from high to low):

  • simde_vshlq_u16
  • simde_vld3_u8
  • simde_vld4q_s16
  • simde_vclzq_s16
  • simde_vld1q_lane_s16
  • simde_vtrn1q_s32
  • simde_vtrn2q_s32
  • simde_vtrn1q_s16

Note that libjpeg-turbo is considering a whole new SIMD implementation just for WASM, so please don't spend too much time on this.

kleisauke avatar Jun 12 '21 11:06 kleisauke