simde WASM SIMD implementations of SSE

WASM SIMD implementations of SSE

Open nemequ opened this issue 4 years ago • 19 comments

We should add WASM simd128 implementations of as many SSE/SSE2/etc. functions as possible.

If a function has a WASM implementation, please check it off the list.

Not all functions will have reasonable WASM equivalents. If you find a function which you don't think can be implemented better than the portable fallbacks do, please just change it from a checklist item to a regular list item and move it to the end of the list for that particular ISA extension.

Note that GitHub doesn't tend to handle massive lists very well; it will probably be better to edit this comment and just put an x inside of the square brackets for list items to check them off.

If you don't have permission to edit this comment but would like to help out, please just add a comment and I'll add you to the project.

There are already a lot of functions in SIMDe which have WASM implementations but haven't been checked off of this list, so the first step is probably to go through the code and check off the functions which already have WASM implementations. It shouldn't be too hard, just grep for "wasm_" and check off the relevant function(s) for each result. When you do this for an ISA extension, please check it off of this first list:

[ ] SSE
[ ] SSE2
[ ] SSE3
[ ] SSSE3
[ ] SSE4.1

SSE:

[ ] _mm_max_pi16
[ ] _m_pmaxsw
[ ] _mm_max_pu8
[ ] _m_pmaxub
[ ] _mm_min_pi16
[ ] _m_pminsw
[ ] _mm_min_pu8
[ ] _m_pminub
[ ] _mm_mulhi_pu16
[ ] _m_pmulhuw
[ ] _mm_avg_pu8
[ ] _m_pavgb
[ ] _mm_avg_pu16
[ ] _m_pavgw
[ ] _mm_sad_pu8
[ ] _m_psadbw
[ ] _mm_cvtsi32_ss
[ ] _mm_cvt_si2ss
[ ] _mm_cvtsi64_ss
[ ] _mm_cvtpi32_ps
[ ] _mm_cvt_pi2ps
[ ] _mm_cvtpi16_ps
[ ] _mm_cvtpu16_ps
[ ] _mm_cvtpi8_ps
[ ] _mm_cvtpu8_ps
[ ] _mm_cvtpi32x2_ps
[ ] _mm_stream_pi
[ ] _mm_maskmove_si64
[ ] _m_maskmovq
[ ] _mm_extract_pi16
[ ] _m_pextrw
[ ] _mm_insert_pi16
[ ] _m_pinsrw
[ ] _mm_movemask_pi8
[ ] _m_pmovmskb
[ ] _mm_shuffle_pi16
[ ] _m_pshufw
[x] _mm_add_ps
[x] _mm_sub_ss
[x] _mm_sub_ps
[x] _mm_mul_ss
[x] _mm_mul_ps
[x] _mm_div_ss
[x] _mm_div_ps
[x] _mm_sqrt_ss
[x] _mm_sqrt_ps
[x] _mm_rcp_ss
[x] _mm_rcp_ps
[ ] _mm_rsqrt_ss
[x] _mm_rsqrt_ps
[x] _mm_min_ss
[x] _mm_min_ps
[x] _mm_max_ss
[x] _mm_max_ps
[x] _mm_and_ps
[x] _mm_andnot_ps
[x] _mm_or_ps
[x] _mm_xor_ps
[x] _mm_cmpeq_ss
[x] _mm_cmpeq_ps
[ ] _mm_cmplt_ss
[x] _mm_cmplt_ps
[x] _mm_cmple_ss
[x] _mm_cmple_ps
[x] _mm_cmpgt_ss
[x] _mm_cmpgt_ps
[x] _mm_cmpge_ss
[x] _mm_cmpge_ps
[x] _mm_cmpneq_ss
[x] _mm_cmpneq_ps
[ ] _mm_cmpnlt_ss
[ ] _mm_cmpnlt_ps
[ ] _mm_cmpnle_ss
[ ] _mm_cmpnle_ps
[ ] _mm_cmpngt_ss
[ ] _mm_cmpngt_ps
[ ] _mm_cmpnge_ss
[ ] _mm_cmpnge_ps
[x] _mm_cmpord_ss
[x] _mm_cmpord_ps
[x] _mm_cmpunord_ss
[x] _mm_cmpunord_ps
[x] _mm_comieq_ss
[x] _mm_comilt_ss
[x] _mm_comile_ss
[x] _mm_comigt_ss
[x] _mm_comige_ss
[x] _mm_comineq_ss
[x] _mm_ucomieq_ss
[x] _mm_ucomilt_ss
[x] _mm_ucomile_ss
[x] _mm_ucomigt_ss
[x] _mm_ucomige_ss
[x] _mm_ucomineq_ss
[ ] _mm_cvtss_si32
[ ] _mm_cvt_ss2si
[ ] _mm_cvtss_si64
[ ] _mm_cvtss_f32
[ ] _mm_cvtps_pi32
[ ] _mm_cvt_ps2pi
[ ] _mm_cvttss_si32
[ ] _mm_cvtt_ss2si
[ ] _mm_cvttss_si64
[ ] _mm_cvttps_pi32
[ ] _mm_cvtt_ps2pi
[ ] _mm_cvtps_pi16
[ ] _mm_cvtps_pi8
[ ] _mm_set_ss
[x] _mm_set1_ps
[x] _mm_set_ps1
[x] _mm_set_ps
[ ] _mm_setr_ps
[x] _mm_setzero_ps
[x] _mm_loadh_pi
[x] _mm_loadl_pi
[x] _mm_load_ss
[x] _mm_load1_ps
[x] _mm_load_ps1
[x] _mm_load_ps
[x] _mm_loadu_ps
[ ] _mm_loadr_ps
[ ] _mm_stream_ps
[x] _mm_storeh_pi
[x] _mm_storel_pi
[x] _mm_store_ss
[ ] _mm_store1_ps
[x] _mm_store_ps1
[x] _mm_store_ps
[x] _mm_storeu_ps
[ ] _mm_storer_ps
[x] _mm_move_ss
[ ] _mm_shuffle_ps
[ ] _mm_unpackhi_ps
[ ] _mm_unpacklo_ps
[ ] _mm_movehl_ps
[ ] _mm_movelh_ps
[x] _mm_movemask_ps
[ ] _mm_malloc
[ ] _mm_free
[ ] _mm_undefined_ps
[ ] _mm_storeu_si16
[ ] _mm_loadu_si64
[ ] _mm_storeu_si64
[ ] _mm_loadu_si16
[x] _mm_add_ss
_mm_getcsr
_mm_setcsr
_mm_prefetch
_mm_sfence

SSE2:

[ ] _mm_undefined_pd
[ ] _mm_undefined_si128
[x] _mm_loadu_si32
[x] _mm_storeu_si32
[x] _mm_pause
[ ] _mm_clflush
[ ] _mm_lfence
[ ] _mm_mfence
[x] _mm_add_epi8
[x] _mm_add_epi16
[x] _mm_add_epi32
[ ] _mm_add_si64
[x] _mm_add_epi64
[x] _mm_adds_epi8
[x] _mm_adds_epi16
[x] _mm_adds_epu8
[x] _mm_adds_epu16
[x] _mm_avg_epu8
[x] _mm_avg_epu16
[x] _mm_madd_epi16
[x] _mm_max_epi16
[x] _mm_max_epu8
[x] _mm_min_epi16
[x] _mm_min_epu8
[x] _mm_mulhi_epi16
[x] _mm_mulhi_epu16
[x] _mm_mullo_epi16
[ ] _mm_mul_su32
[x] _mm_mul_epu32
[ ] _mm_sad_epu8
[x] _mm_sub_epi8
[x] _mm_sub_epi16
[x] _mm_sub_epi32
[ ] _mm_sub_si64
[x] _mm_sub_epi64
[x] _mm_subs_epi8
[x] _mm_subs_epi16
[x] _mm_subs_epu8
[x] _mm_subs_epu16
[ ] _mm_slli_si128
[x] _mm_bslli_si128
[ ] _mm_bsrli_si128
[x] _mm_slli_epi16
[x] _mm_sll_epi16
[x] _mm_slli_epi32
[x] _mm_sll_epi32
[x] _mm_slli_epi64
[x] _mm_sll_epi64
[x] _mm_srai_epi16
[x] _mm_sra_epi16
[x] _mm_srai_epi32
[x] _mm_sra_epi32
[ ] _mm_srli_si128
[x] _mm_srli_epi16
[ ] _mm_srl_epi16
[x] _mm_srli_epi32
[x] _mm_srl_epi32
[x] _mm_srli_epi64
[x] _mm_srl_epi64
[x] _mm_and_si128
[x] _mm_andnot_si128
[x] _mm_or_si128
[x] _mm_xor_si128
[x] _mm_cmpeq_epi8
[x] _mm_cmpeq_epi16
[x] _mm_cmpeq_epi32
[x] _mm_cmpgt_epi8
[x] _mm_cmpgt_epi16
[x] _mm_cmpgt_epi32
[x] _mm_cmplt_epi8
[x] _mm_cmplt_epi16
[x] _mm_cmplt_epi32
[x] _mm_cvtepi32_pd
[ ] _mm_cvtsi32_sd
[ ] _mm_cvtsi64_sd
[ ] _mm_cvtsi64x_sd
[x] _mm_cvtepi32_ps
[ ] _mm_cvtpi32_pd
[x] _mm_cvtsi32_si128
[x] _mm_cvtsi64_si128
[ ] _mm_cvtsi64x_si128
[x] _mm_cvtsi128_si32
[x] _mm_cvtsi128_si64
[ ] _mm_cvtsi128_si64x
[ ] _mm_set_epi64
[x] _mm_set_epi64x
[x] _mm_set_epi32
[x] _mm_set_epi16
[x] _mm_set_epi8
[x] _mm_set1_epi64
[x] _mm_set1_epi64x
[x] _mm_set1_epi32
[x] _mm_set1_epi16
[x] _mm_set1_epi8
[ ] _mm_setr_epi64
[ ] _mm_setr_epi32
[ ] _mm_setr_epi16
[ ] _mm_setr_epi8
[x] _mm_setzero_si128
[ ] _mm_loadl_epi64
[ ] _mm_load_si128
[ ] _mm_loadu_si128
[ ] _mm_maskmoveu_si128
[x] _mm_store_si128
[ ] _mm_storeu_si128
[ ] _mm_storel_epi64
[ ] _mm_stream_si128
[ ] _mm_stream_si32
[ ] _mm_stream_si64
[ ] _mm_movepi64_pi64
[ ] _mm_movpi64_epi64
[x] _mm_move_epi64
[x] _mm_packs_epi16
[x] _mm_packs_epi32
[x] _mm_packus_epi16
[x] _mm_extract_epi16
[x] _mm_insert_epi16
[x] _mm_movemask_epi8
[ ] _mm_shuffle_epi32
[ ] _mm_shufflehi_epi16
[ ] _mm_shufflelo_epi16
[ ] _mm_unpackhi_epi8
[ ] _mm_unpackhi_epi16
[ ] _mm_unpackhi_epi32
[ ] _mm_unpackhi_epi64
[ ] _mm_unpacklo_epi8
[ ] _mm_unpacklo_epi16
[ ] _mm_unpacklo_epi32
[ ] _mm_unpacklo_epi64
[x] _mm_add_sd
[x] _mm_add_pd
[ ] _mm_div_sd
[x] _mm_div_pd
[ ] _mm_max_sd
[x] _mm_max_pd
[ ] _mm_min_sd
[x] _mm_min_pd
[ ] _mm_mul_sd
[x] _mm_mul_pd
[ ] _mm_sqrt_sd
[x] _mm_sqrt_pd
[ ] _mm_sub_sd
[x] _mm_sub_pd
[x] _mm_and_pd
[x] _mm_andnot_pd
[x] _mm_or_pd
[x] _mm_xor_pd
[x] _mm_cmpeq_sd
[x] _mm_cmplt_sd
[x] _mm_cmple_sd
[ ] _mm_cmpgt_sd
[ ] _mm_cmpge_sd
[ ] _mm_cmpord_sd
[ ] _mm_cmpunord_sd
[x] _mm_cmpneq_sd
[ ] _mm_cmpnlt_sd
[ ] _mm_cmpnle_sd
[ ] _mm_cmpngt_sd
[ ] _mm_cmpnge_sd
[x] _mm_cmpeq_pd
[x] _mm_cmplt_pd
[x] _mm_cmple_pd
[x] _mm_cmpgt_pd
[x] _mm_cmpge_pd
[x] _mm_cmpord_pd
[x] _mm_cmpunord_pd
[x] _mm_cmpneq_pd
[ ] _mm_cmpnlt_pd
[ ] _mm_cmpnle_pd
[ ] _mm_cmpngt_pd
[ ] _mm_cmpnge_pd
[x] _mm_comieq_sd
[x] _mm_comilt_sd
[x] _mm_comile_sd
[x] _mm_comigt_sd
[x] _mm_comige_sd
[x] _mm_comineq_sd
[x] _mm_ucomieq_sd
[x] _mm_ucomilt_sd
[x] _mm_ucomile_sd
[x] _mm_ucomigt_sd
[x] _mm_ucomige_sd
[x] _mm_ucomineq_sd
[x] _mm_cvtpd_ps
[x] _mm_cvtps_pd
[ ] _mm_cvtpd_epi32
[ ] _mm_cvtsd_si32
[ ] _mm_cvtsd_si64
[ ] _mm_cvtsd_si64x
[ ] _mm_cvtsd_ss
[x] _mm_cvtsd_f64
[ ] _mm_cvtss_sd
[ ] _mm_cvttpd_epi32
[ ] _mm_cvttsd_si32
[ ] _mm_cvttsd_si64
[ ] _mm_cvttsd_si64x
[ ] _mm_cvtps_epi32
[ ] _mm_cvttps_epi32
[ ] _mm_cvtpd_pi32
[ ] _mm_cvttpd_pi32
[x] _mm_set_sd
[x] _mm_set1_pd
[x] _mm_set_pd1
[x] _mm_set_pd
[ ] _mm_setr_pd
[x] _mm_setzero_pd
[x] _mm_load_pd
[x] _mm_load1_pd
[x] _mm_load_pd1
[ ] _mm_loadr_pd
[ ] _mm_loadu_pd
[x] _mm_load_sd
[x] _mm_loadh_pd
[x] _mm_loadl_pd
[ ] _mm_stream_pd
[x] _mm_store_sd
[x] _mm_store1_pd
[x] _mm_store_pd1
[ ] _mm_store_pd
[ ] _mm_storeu_pd
[ ] _mm_storer_pd
[ ] _mm_storeh_pd
[ ] _mm_storel_pd
[x] _mm_unpackhi_pd
[ ] _mm_unpacklo_pd
[x] _mm_movemask_pd
[ ] _mm_shuffle_pd
[x] _mm_move_sd
[ ] _mm_castpd_ps
[ ] _mm_castpd_si128
[ ] _mm_castps_pd
[ ] _mm_castps_si128
[ ] _mm_castsi128_pd
[ ] _mm_castsi128_ps

SSE3:

[ ] _mm_addsub_ps
[ ] _mm_addsub_pd
[ ] _mm_hadd_pd
[ ] _mm_hadd_ps
[ ] _mm_hsub_pd
[ ] _mm_hsub_ps
[ ] _mm_lddqu_si128
[x] _mm_movedup_pd
[ ] _mm_loaddup_pd
[x] _mm_movehdup_ps
[x] _mm_moveldup_ps

SSSE3:

[ ] _mm_abs_pi8
[x] _mm_abs_epi8
[ ] _mm_abs_pi16
[x] _mm_abs_epi16
[ ] _mm_abs_pi32
[x] _mm_abs_epi32
[x] _mm_shuffle_epi8
[ ] _mm_shuffle_pi8
[ ] _mm_alignr_epi8
[ ] _mm_alignr_pi8
[ ] _mm_hadd_epi16
[ ] _mm_hadds_epi16
[ ] _mm_hadd_epi32
[ ] _mm_hadd_pi16
[ ] _mm_hadd_pi32
[ ] _mm_hadds_pi16
[ ] _mm_hsub_epi16
[ ] _mm_hsubs_epi16
[ ] _mm_hsub_epi32
[ ] _mm_hsub_pi16
[ ] _mm_hsub_pi32
[ ] _mm_hsubs_pi16
[ ] _mm_maddubs_epi16
[ ] _mm_maddubs_pi16
[x] _mm_mulhrs_epi16
[ ] _mm_mulhrs_pi16
[ ] _mm_sign_epi8
[ ] _mm_sign_epi16
[ ] _mm_sign_epi32
[ ] _mm_sign_pi8
[ ] _mm_sign_pi16
[ ] _mm_sign_pi32

SSE4.1:

[ ] _mm_blend_pd
[ ] _mm_blend_ps
[x] _mm_blendv_pd
[x] _mm_blendv_ps
[x] _mm_blendv_epi8
[ ] _mm_blend_epi16
[ ] _mm_dp_pd
[ ] _mm_dp_ps
[ ] _mm_extract_ps
[x] _mm_extract_epi8
[x] _mm_extract_epi32
[x] _mm_extract_epi64
[ ] _mm_insert_ps
[x] _mm_insert_epi8
[x] _mm_insert_epi32
[x] _mm_insert_epi64
[x] _mm_max_epi8
[x] _mm_max_epi32
[x] _mm_max_epu32
[x] _mm_max_epu16
[x] _mm_min_epi8
[x] _mm_min_epi32
[x] _mm_min_epu32
[x] _mm_min_epu16
[x] _mm_packus_epi32
[ ] _mm_cmpeq_epi64
[x] _mm_cvtepi8_epi16
[x] _mm_cvtepi8_epi32
[ ] _mm_cvtepi8_epi64
[x] _mm_cvtepi16_epi32
[ ] _mm_cvtepi16_epi64
[ ] _mm_cvtepi32_epi64
[x] _mm_cvtepu8_epi16
[x] _mm_cvtepu8_epi32
[ ] _mm_cvtepu8_epi64
[x] _mm_cvtepu16_epi32
[ ] _mm_cvtepu16_epi64
[ ] _mm_cvtepu32_epi64
[x] _mm_mul_epi32
[x] _mm_mullo_epi32
[x] _mm_testz_si128
[x] _mm_testc_si128
[x] _mm_testnzc_si128
[x] _mm_test_all_zeros
[x] _mm_test_mix_ones_zeros
[x] _mm_test_all_ones
[x] _mm_round_pd
[x] _mm_floor_pd
[x] _mm_ceil_pd
[x] _mm_round_ps
[x] _mm_floor_ps
[x] _mm_ceil_ps
[ ] _mm_round_sd
[ ] _mm_floor_sd
[ ] _mm_ceil_sd
[ ] _mm_round_ss
[ ] _mm_floor_ss
[ ] _mm_ceil_ss
[ ] _mm_minpos_epu16
[ ] _mm_mpsadbw_epu8
[ ] _mm_stream_load_si128

Some of the functions won't see much, if any, improvements since we already have GCC-style vector extension and OpenMP SIMD support. The real benefit will be for the functions that can't use GCC-style vectors. For example, saturated operations, min/max, etc. And of course there will be a lot of cases where maybe the compiler can autovectorize and maybe it can't (especially if operating at -O2 instead of -O3), and explicitly calling the simd128 versions is more likely to result in vectorized versions.

~The implementation shouldn't be too difficult. We'll need to add a v128_t entry to the simde_m128_private, simde_m128i_private, simde_m128d_private, and of course properly detect WASM SIMD in simde-arch.h then use that in the headers like sse.h and sse2.h. Other than that it should generally just be a matter of adding an #elif defined(SIMDE_ARCH_WASM_SIMD128) and an implementation.~

~Right now we don't have an emscripten build in CI; we'll have to figure out how to add it back in. I removed it because emscripten was running out of memory (it works fine on my desktop with 24 GiB RAM, but kept getting killed on the little VM workres). Adding in some optimization flags may help reduce the memory requirement; I hadn't done that because emscripten was generating an internal compiler error, but that may be fixed now (at least using tot, hopefully there will be a release soon); if not, I should be reporting them to emscripten so they can be fixed.~

Feb 21 '20 21:02 nemequ

I just pushed a commit (fb6dca9804bef6861fe3a7f82be3d2a6d88fb509) which adds the necessary bits to SSE and implements one function (_mm_add_ps) as an example.

Feb 22 '20 07:02 nemequ

Emscripten is working again in CI; I think it was just a buggy release (1.39.7 probably) that was causing the memory issues.

Unfortunately attempting to enable WASM SIMD support (with -msimd128 -s SIMD=1) causes an error when running the code. I've reported the issue to the emscripten people.

Feb 22 '20 09:02 nemequ

The error has been fixed in more recent versions of V8 than Node.js uses. Using latest d8 instead of nodejs to run the tests fixed the problem, and -msimd128 -s SIMD=1 is now used on Travis for the emscripten build.

Now that they'll be properly tested I guess it's time to start adding implementations.

Feb 25 '20 17:02 nemequ

Awesome work Evan, thanks for kicking this off!

Feb 27 '20 02:02 ngzhian

I have added implementation for two functions and created a PR. I was wondering about how to write some of the harder WASM implementations. I tried but I couldn't find any documentation for functions of type wasm_xx_xx. Also how to test if my implementation is correct for this issue? Other than that, please tell me if I did the rebasing correctly this time.

Mar 07 '20 19:03 masterchef2209

@masterchef2209 Not sure if I understand your question right. If you're looking for documentation of Wasm SIMD functions, you can find them at https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md

Mar 09 '20 01:03 ngzhian

This might be a fun place to test this work: https://github.com/gabrielcuvillier/d3wasm

SIMD code

Based on profiling, almost 25% of the game time is done in low level Vector/Maths operations. These functions have various SIMD backends (for SSE/SSE2/SSE3/etc.), but none of these backends are supported by WebAssembly. As so, the generic C++ backend code is used.

http://www.continuation-labs.com/projects/d3wasm/

Mar 30 '20 09:03 mr-c

@juj, sorry to pull you into this without warning. I see you've been working on adding SSE support back into emscripten… It seems like a waste to duplicate efforts, so I'm wondering if you would be interested in some sort of collaboration on this? At the very least, hopefully we can figure out how everyone sees the projects fitting together so we can all work with that in mind.

I haven't looked into your work in too much detail, but it looks excellent. I really like the documentation, which has given me some ideas to think about for SIMDe's (severely lacking) documentation, especially in combination with the plan for nemequ/simde#310.

My first thought when I saw your work was that maybe we should just drop WASM SIMD from SIMDe but after a bit more thought I decided keeping it around would make sense, at least for now, even though your work in emscripten would definitely reduce the need for it.

I don't know how extensive you intend to make the SSE headers in emscripten, but SIMDe's support for Intel functions using portable fallbacks is quite extensive. We have full support for SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX, FMA, and GFNI, and work is underway on several others. You probably already have more explicit WASM implementations than we do, but a lot of the portable implementations should be just as fast with autovectorization. And, of course, we implement functions that it seems like you want to just skip, such as MMX and the 64-bit functions from SSE/SSE2.

The portable implementations are also nice because can compile to WASM without SIMD as well. Given the limited support for WASM SIMD right now, this is a nice way to let people target WebAssembly for now and trivially switch on SIMD either at some point in the future and/or conditionally based on runtime detection.

SIMDe also isn't limited to emscripten. You can also use the WASM support in clang, especially now that wasm_simd128.h has moved over to clang. Of course the SSE headers could be moved over to clang, too, but hopefully in the future we'll have more compilers targeting WASM.

We have a pretty extensive test suite which could make development a lot easier. It's not perfect; there are cases we don't cover (yet), but it's definitely useful. If we can integrate the projects a bit better you might be able to make use of them.

Several options for how to proceed come to mind:

Obviously we could just ignore one another. This seems a terrible waste, and definitely my least-favorite option. IIRC there are ~5k vector functions across Intel's APIs, that's a lot of wasted effort. Let's not do that :)

We could just drop WASM SIMD from SIMDe and basically treat Emscripten as if it supported SSE natively. This is may be the easiest option for SIMDe, though we'll need a lot of ifdefs to handle cases where emscripten doesn't yet implement the function in question, and unless the headers move from emscripten to clang this would be a regression for clang.

If you're willing to allow us to use your implementations under an MIT license, we could merge them into SIMDe. That gives us fast implementations on emscripten and clang, but even if development effort is mostly not duplicated the code would be. Unfortunately switching the license of SIMDe would be tough since we've merged a bunch of code from SSE2NEON and I'm not sure I can get a hold of all the authors.

We could also reverse that and have the initial implementations in SIMDe, then copy them to emscripten after they've been tested. Same problem with code duplication, of course, and the emscripten people may still (quite reasonably) insist on tests of the emscripten versions, so I'm not sure how beneficial this would really be.

If you would prefer that this code live outside of emscripten/LLVM, I would certainly be happy to have it in SIMDe. Until they accepted your PR I actually thought this was the Emscripten developer's preference; I seem to remember someone making the point that it makes it clearer that these functions are emulated instead of just silently accepting code that may end up being quite slow, and I know they have dropped an SSE emulation layer once before. This is actually my preference, but I'm obviously biased and can't be trusted :)

I think keeping compatibility layers like SIMDe outside of the compiler makes more sense. SIMDe's support for providing APIs other than Intel's (e.g., implementing NEON with WASM) is extremely limited, but we're working on it. I'm guessing emscripten doesn't want to provide compatibility layers for other architectures as well, so there is a bit of an inconsistency there, but maybe I'm the only one bothered by it.

This should also reduce the effort required to create WASM implementations of additional functions, which I know is a huge time sink. Like I mentioned earlier, we already have a pretty decent test suite, so all you would really have to do is add an #elif defined(SIMDE_WASM_SIMD128_NATIVE) and the implementation. I don't think it would be a problem to give you commit access to SIMDe, so there shouldn't really be any bottlenecks other than our extensive, rather slow, and pedantic CI, but too much testing is a nice problem to have.

So, how does everyone think we should handle this? While I do have an opinion I'm not insistent. Mainly I just want to make sure we're all on the same page.

May 30 '20 01:05 nemequ

@juj and @tlively will probably respond here with thoughts about working together - which we should definitely do as much as possible, I agree! - but just one quick point on this (as an emscripten dev but one that doesn't work on simd):

If you're willing to allow us to use your implementations under an MIT license, we could merge them into SIMDe.

That's already the case, Emscripten is MIT licensed.

May 30 '20 02:05 kripken

If you're willing to allow us to use your implementations under an MIT license, we could merge them into SIMDe. That gives us fast implementations on emscripten and clang, but even if development effort is mostly not duplicated the code would be.

This is my preferred direction. Going even further, it would be nice if we made using SIMDe in Emscripten as easy and transparent to users as using the current Emscripten-native SSE headers @juj is working on. Then we could replace the Emscripten-native headers with SIMDe's implementation. But it would be great if SIMDe leans heavily on @juj's work to minimize duplicated effort.

I think it makes the most sense in the immediate future for @juj to continue his work on the headers upstream in Emscripten and for that work to be imported into SIMDe, at least until Emscripten gains transparent SIMDe integration. That way the SSE intrinsics will be ready to use transparently with Emscripten sooner.

How does that plan sound to everyone? @ngzhian @seanptmaher

May 30 '20 02:05 tlively

Wow, I sprang this on everyone on a Friday night, I really wasn't expecting such a quick response. Thanks to both of you :)

That's already the case, Emscripten is MIT licensed.

Nice! I thought it was that MIT/BSD hybrid LLVM uses, but normal MIT makes stealing code from eachother trivial.

I think it makes the most sense in the immediate future for @juj to continue his work on the headers upstream in Emscripten and for that work to be imported into SIMDe

Since we don't need to worry about the license, I can copy all the implementations from emscripten right away. @juj, would that be okay with you? Even if it's legal, I'm not comfortable doing that without your blessing…

Going even further, it would be nice if we made using SIMDe in Emscripten as easy and transparent to users as using the current Emscripten-native SSE headers @juj is working on. Then we could replace the Emscripten-native headers with SIMDe's implementation. But it would be great if SIMDe leans heavily on @juj's work to minimize duplicated effort.

This sounds good to me, but I'm not sure what it would look like.

Using SIMDe should already be pretty trivial. It should be a matter of just #define SIMDE_ENABLE_NATIVE_ALIASES (assuming you want to be able to use, for example, _mm_add_ps instead of simde_mm_add_ps) and including the relevant header(s). SIMDe only uses a build system for the tests.

If you're talking about somehow distributing SIMDe with Emscripten, there are a couple of options:

We have amalgamated headers available if you want to just drop them in (note that, for example, sse2.h includes sse.h and mmx.h, so you only need one).

A submodule or subtree is also an option; the SIMDe repo is actually pretty big thanks to all the tests, but if you want something lighter I've started playing with a separate repo without the tests. ~I need to work on it some more (I'd like to preserve the git blame; I think git filter-branch can do the trick, but I haven't used it before), but that can happen sooner rather than later if you want me to prioritize it.~ (edit: done)

As far as making it completely transparent, I guess the most straightforward solution would be to just have the headers in emscripten with the (horrible) Intel names just define SIMDE_ENABLE_NATIVE_ALIASES and include SIMDe. For example, xmmintrin.h could just be

#if !defined(SIMDE_ENABLE_NATIVE_ALIASES)
#define SIMDE_ENABLE_NATIVE_ALIASES
#endif
#include "simde/x86/sse2.h"

With this kind of integration I'd want to add some more emscripten builds to CI to make sure we don't break anything (and adding -Weverything), including running the tests with a few different interpreters (right now we only test with v8). We already use emscripten tot in our CI, but it might also be a good idea to set up a cron job to make sure it runs at least once/day (not really a problem for SIMDe lately given all the activity, but still…). I'd also want to add at least one or two emscripten devs as collaborators on the SIMDe projects in case of emergency.

Of course, all this only makes sense if @juj would be comfortable working within SIMDe instead of directly in emscripten. Maybe we should set up a call sometime to go over the project a bit, answer any questions, discuss any concerns, etc.?

May 30 '20 03:05 nemequ

I'm wondering if you would be interested in some sort of collaboration on this?

Yeah, certainly, whatever coordination we can find, definitely open for it!

I don't know how extensive you intend to make the SSE headers in emscripten, but SIMDe's support for Intel functions using portable fallbacks is quite extensive. We have full support for SSE, SSE2, SSE3, SSSE3, SSE4.1, AVX, FMA, and GFNI, and work is underway on several others. You probably already have more explicit WASM implementations than we do, but a lot of the portable implementations should be just as fast with autovectorization. And, of course, we implement functions that it seems like you want to just skip, such as MMX and the 64-bit functions from SSE/SSE2.

Yeah, SSE just recently landed, SSE2 is coming in this PR, and SSE3 + SSSE3 are in this PR. After those land, I am still adding SSE4.1 and FMA. I do have implementations of these from when I wrote support for them to Emscripten during the SIMD.js era, so this is not new work, but basically restoring old code and renaming/adapting the original SIMD.js intrinsic calls over to new Wasm SIMD.

MMX, AVX, AVX2, SSE4.2 and GFNI are out of my scope. Adding support for those in Emscripten though would be really great.

Also having support to target NEON intrinsics from Emscripten would be great to have - that is beyond my cycles though.

We have a pretty extensive test suite which could make development a lot easier.

The approach I chose for testing Emscripten SSEx support is to dual compile the instructions against native vs Emscripten, and then print the results, and text-diff to verify that both sides agree. That gives quite a simple test suite, although its weakness/limitation is it hinging on accurately enumerating the input data to test.

If you're willing to allow us to use your implementations under an MIT license, we could merge them into SIMDe.

Definitely, feel free to use any of the work. If you find that any of the implementations I wrote are sloppy and can be improved, would be great to focus attention to them in a PR or an issue, and let's look at wasm-dises or v8 generated disassembly together.

I seem to remember someone making the point that it makes it clearer that these functions are emulated instead of just silently accepting code that may end up being quite slow

The intent is that Emscripten will not allow targeting SSE instruction sets unless one explicitly passes -msse, -msse2, etc. on the command line. I.e. unlike native compilers that default to enabling SSE, with Emscripten one has to explicitly pass -msse manually on the command line.

, and I know they have dropped an SSE emulation layer once before. This is actually my preference, but I'm obviously biased and can't be trusted :)

That was tlively dropping the SSE code on the basis that he thought it was somehow fake/running in scalar based on ecmascript-simd.js polyfill, to make room to ease the implementation for the upcoming Wasm SIMD work (though he did not delete any of the actual SIMD.js stuff back then, that was eventually removed in https://github.com/emscripten-core/emscripten/pull/11180).

Emscripten should definitely support SSE (and NEON!) out of the box, by passing appropriate -m* flags to target the respective archs. E.g. if one wants to target Wasm SIMD without enabling any SSE/NEON paths, one can pass -msimd128, and if one wants to go via SSE, one can pass -msse, and so on.

@juj, would that be okay with you? Even if it's legal, I'm not comfortable doing that without your blessing…

Yeah, definitely, go for it!

Ultimately I have the following requirements for Emscripten SSE support:

it should work without changing user code, i.e. use include xmmintrin.h et al., and use the intrinsics names as they are, the only exception being things that are hard-unsupported (e.g. rounding mode control),
it should work without having to install external packages, i.e. Emscripten should bundle SIMDe in it if we go that route,
the performance landscape documentation is first tier critical: (our) SIMD developers will be blind without that, so any changes to any of the implementation will need to track that doc. I am hoping to expand that doc either with native V8 disassembly side, or synthetic benchmark results (once they stabilize); we need "white box" information on how the intrinsics will be (soft-)guaranteed/expected to perform, to keep away from "try random things for what is fastest" land (which already riddles much of native SIMD development in practice).
easy way to find & verify the implementation of a particular intrinsic (to debug/double-check something)

So I suppose whether Emscripten eventually uses the headers that I contributed, or it uses SIMDe for the support that is brought in to Emscripten, that is not that particular, as long as we can retain the above.

May 31 '20 19:05 juj

MMX, AVX, AVX2, SSE4.2 and GFNI are out of my scope. Adding support for those in Emscripten though would be really great.

Without AVX, I guess FMA will be limited to the 128-bit functions? What about the 128-bit AVX-512 functions (AVX-512VL and later extensions)?

The approach I chose for testing Emscripten SSEx support is to dual compile the instructions against native vs Emscripten, and then print the results, and text-diff to verify that both sides agree. That gives quite a simple test suite, although its weakness/limitation is it hinging on accurately enumerating the input data to test.

That makes sense; we can't really do that in SIMDe because we target other architectures, but you get to just run your tests on x86. I used to have some of these in SIMDe (in addition to the pre-generated vectors), but the API for creating them was horrible. I'd like to try to get some going again, but I need to think through the API a bit better this time.

How do you deal with functions which require immediate-mode arguments?

Emscripten should definitely support SSE (and NEON!) out of the box, by passing appropriate -m* flags to target the respective archs. E.g. if one wants to target Wasm SIMD without enabling any SSE/NEON paths, one can pass -msimd128, and if one wants to go via SSE, one can pass -msse, and so on.

Obviously a lot of that logic would need to go on the emscripten side. With SIMDe if you include the relevant header you have access to those functions (for example, SSE2 is defined in simde/x86/sse2.h), so emscripten's emmintrin.h could just only include SIMDe's sse2.h if -msse2 has been passed, and maybe emit some sort of diagnostic otherwise?

The only real complications here are functions which we have pulled into earlier headers since they are useful for implementing other functions. This mostly happens with AVX-512VL and AVX-512BW functions going in our avx512f.h; for earlier extensions there are probably less than a handful of functions we do this for. If that's something you're not willing to live with I think we can probably come up with a solution.

That said, are you sure you want to require passing -msse*? It seems like an unnecessary complication to me. When targeting x86 -msse* makes sense because using those functions adds a dependency on hardware which supports it, but that's not true for emscripten. Emitting a diagnostic if -msimd128 isn't passed seems reasonable to me (not the default for SIMDe, but you can do that in *mmintrin.h), but I don't really get -msse. This is really an emscripten question, but I think something like:

#if !defined(__XMMINTRIN_H)
#define __XMMINTRIN_H

#if !defined(__wasm_simd128__) && !defined(SHUT_UP_AND_DO_IT)
#  warning SSE2 likely to be very slow without passing -msimd128
#endif

#if !defined(SIMDE_ENABLE_NATIVE_ALIASES)
#  define SIMDE_ENABLE_NATIVE_ALIASES
#endif
#include "simde/x86/sse2.h"

#endif /* !defined(__XMMINTRIN_H) */

Would be very reasonable (after choosing a different name for a certain macro, of course).

it should work without changing user code, i.e. use include xmmintrin.h et al., and use the intrinsics names as they are, the only exception being things that are hard-unsupported (e.g. rounding mode control),

We define everything in the simde_* namespace; i.e., simde_mm_add_ps instead of _mm_add_ps, but if you define SIMDE_ENABLE_NATIVE_ALIASES prior to including the header we will also #define _mm_add_ps(a, b) simde_mm_add_ps(a, b).

We do use fixed-size types, so for example simde_mm_set1_epi32 takes an int32_t instead of an int. I don't think this should be a problem for emscripten since the types match what Intel uses, but in cases where there might be an issue we add explicit casts in the native alias macros.

On CI we actually have a script to strip the simde_ prefix so all our tests run against the native aliases, and it works even with -Weverything. I don't think I've actually tried running that test on emscripten, but we could definitely do so, and add it to CI.

So I think we're good on this point as long as you're not bothered by a bunch of extra stuff in simde_*.

the performance landscape documentation is first tier critical: (our) SIMD developers will be blind without that, so any changes to any of the implementation will need to track that doc. I am hoping to expand that doc either with native V8 disassembly side, or synthetic benchmark results (once they stabilize); we need "white box" information on how the intrinsics will be (soft-)guaranteed/expected to perform, to keep away from "try random things for what is fastest" land (which already riddles much of native SIMD development in practice).

easy way to find & verify the implementation of a particular intrinsic (to debug/double-check something)

This all sounds good, but I feel like I should mention that this is a little more complicated for SIMDe since one of our targets is straight C99. We decorate it with OpenMP SIMD, clang loop pragmas, GCC ivdep, etc., when possible, but obviously we can't specify exactly what native instructions the code compiles to when we can't specify what we're running on. That doesn't preclude us from doing this, it just means we have to think about cases other than WASM.

I'd love to have some tooling that verifies code compiles to the instructions we're expecting (when we're targeting that extension). And, if we can generate that type of information we could probably also use much of the same code to run llvm-mca on the output, which would be a great way for us to spot optimization opportunities (e.g., "why is clang's output for xxx so much slower than GCC's?", or vice versa).

As for verification (I assume you're talking about verifying that the generated code is what we expect, not proving correctness), I usually just do something like https://godbolt.org/z/FVYW6F. It's not automatic, but it's pretty easy… https://simde.netlify.com/godbolt/demo is a bit more fun to play with, but https://godbolt.org/z/REeDLV is probably more useful in this context.

I think automation is going to be crucial here; if we can't enforce this in CI it is pretty much guaranteed to go out of sync without anyone noticing until a user complains that something is slow. I really like the idea of automatically generating a wrapper function which we run through llvm-mca (which could be exported for documentation), coupled with a regression check in CI. It would require some new code, but I bet it's something that a lot of performance-critical projects would be interested in using.

So I suppose whether Emscripten eventually uses the headers that I contributed, or it uses SIMDe for the support that is brought in to Emscripten, that is not that particular, as long as we can retain the above.

Either way it will be using the WASM implementations you contributed. Right now for WASM we mostly rely on the compiler to auto-vectorize the portable (+ OpenMP) implementations, which obviously has limitations. Since it seems like everyone is good with us copying your implementations I think that's definitely the right way to go. That should give us parity on those functions pretty quickly.

I don't see any reason why the documentation would be any more difficult to maintain in SIMDe than emscripten, so that's probably a wash right now.

If you want to distribute SIMDe with emscripten, my opinion is that it would make sense to do it right after copying @juj's implementations so the code isn't duplicated and we don't have to worry about it getting out of sync. So maybe we should talk about how everyone sees it happening. Submodule/subtree? Just copying the files? Some sort of script in the SDK to pull in the latest release (which would require me to actually make a release, which I know would make some other people happy…)? Also, what do you want to see on the SIMDe side of things first (additional CI builds, fixing certain issues, etc)? I'd really like to make sure this all goes smoothly and doesn't turn into a huge headache for everyone involved…

May 31 '20 23:05 nemequ

Emscripten should definitely support SSE (and NEON!) out of the box, by passing appropriate -m* flags to target the respective archs. E.g. if one wants to target Wasm SIMD without enabling any SSE/NEON paths, one can pass -msimd128, and if one wants to go via SSE, one can pass -msse, and so on.

Obviously a lot of that logic would need to go on the emscripten side. With SIMDe if you include the relevant header you have access to those functions (for example, SSE2 is defined in simde/x86/sse2.h), so emscripten's emmintrin.h could just only include SIMDe's sse2.h if -msse2 has been passed, and maybe emit some sort of diagnostic otherwise?

Agreed. I filed https://github.com/emscripten-core/emscripten/issues/11311 to discuss the Emscripten-specific parts of this.

So I think we're good on this point as long as you're not bothered by a bunch of extra stuff in simde_*.

Yes, this sounds fine. I think having some functions in earlier headers than they're supposed to be in is ok by the same logic, too.

I'd love to have some tooling that verifies code compiles to the instructions we're expecting (when we're targeting that extension).

This would be super cool. I'd love to have this even just for the regular wasm_simd128.h implementation. Probably something as simple as dumping the assembly for a file that uses all the intrinsics would be enough.

Jun 01 '20 19:06 tlively

Without AVX, I guess FMA will be limited to the 128-bit functions? What about the 128-bit AVX-512 functions (AVX-512VL and later extensions)?

256-bit and 512-bit wide instructions are out of scope, since Wasm SIMD does not support them. Also AVX-512 is out of scope due to limited hardware availability.

How do you deal with functions which require immediate-mode arguments?

See test_sse.h with the *_Tint test modes macro, the immediate input is unrolled over possible integer values.

That said, are you sure you want to require passing -msse*?

It looks like people are complaining even about that, i.e. a user reported a bug where Emscripten was targeting SSE when their build system passed -msse to it. We settled on Emscripten requiring -msse -msimd128 to enable SSE.

In general we cannot target Wasm SIMD without some form of cmdline flag, since that affects the set of browsers that one can run the build output with.

We define everything in the simde_* namespace; i.e., simde_mm_add_ps instead of _mm_add_ps, but if you define SIMDE_ENABLE_NATIVE_ALIASES prior to including the header we will also #define _mm_add_ps(a, b) simde_mm_add_ps(a, b).

We do use fixed-size types, so for example simde_mm_set1_epi32 takes an int32_t instead of an int. I don't think this should be a problem for emscripten since the types match what Intel uses, but in cases where there might be an issue we add explicit casts in the native alias macros.

On CI we actually have a script to strip the simde_ prefix so all our tests run against the native aliases, and it works even with -Weverything. I don't think I've actually tried running that test on emscripten, but we could definitely do so, and add it to CI.

So I think we're good on this point as long as you're not bothered by a bunch of extra stuff in simde_*.

Hmm, what you describe seems a bit complex for Emscripten's needs. Ideally the code would be straightforward to read without excess indirections/decorations. Being able to open up e.g. emmintrin.h and search the implementation of a particular function is very helpful for programmers, since that will quickly help people figure out the amount of emulation that particular functions take on top of Wasm SIMD, i.e. which are direct mappings, which take a SIMD path, and which take a scalarized fallback.

If you want to distribute SIMDe with emscripten, my opinion is that it would make sense to do it right after copying @juj's implementations so the code isn't duplicated and we don't have to worry about it getting out of sync. So maybe we should talk about how everyone sees it happening. Submodule/subtree? Just copying the files?

I would be in favor of directly copying the files, as submodules incur an amount of git wrangling that people sometimes do not like. Submodules have been proposed in Emscripten before on other targets (html-minifier, closure, terser come to mind), but have been turned down each time.

Jun 03 '20 11:06 juj

In general we cannot target Wasm SIMD without some form of cmdline flag, since that affects the set of browsers that one can run the build output with.

I get that, I just don't see the advantage of requiring people to pass -msse -msimd128 to use SSE instead of just -msimd128.

Hmm, what you describe seems a bit complex for Emscripten's needs. Ideally the code would be straightforward to read without excess indirections/decorations. Being able to open up e.g. emmintrin.h and search the implementation of a particular function is very helpful for programmers, since that will quickly help people figure out the amount of emulation that particular functions take on top of Wasm SIMD, i.e. which are direct mappings, which take a SIMD path, and which take a scalarized fallback.

It's definitely more complicated than what emscripten needs, but it really doesn't require too much extra from the user. Instead of opening up emmintrin.h they just open up sse2.h. The code isn't really that complicated, there is just a #define _mm_foo_bar(baz) simde_mm_foo_bar(baz) macro immediately following the simde_mm_foo_bar function; it's really pretty easy to follow.

If people really have trouble with it we could probably add some sort of script to preprocess the sources and remove some of the cruft that emscripten doesn't need. I'm honestly I'm not sure it's worth the effort, but it's an option.

I would be in favor of directly copying the files, as submodules incur an amount of git wrangling that people sometimes do not like. Submodules have been proposed in Emscripten before on other targets (html-minifier, closure, terser come to mind), but have been turned down each time.

It sounds like that's what everyone on the emscripten side wants to do, and I don't really have a preference.

Jun 04 '20 21:06 nemequ

@nemequ Do we have a way to track progress on this? Would it be possible to generate a checklist for @seanptmaher and others to check off as they go?

Jul 16 '20 07:07 mr-c

Good point. I just added several lists to the original comment, as well as some directions for how I think we can handle this.

Jul 17 '20 03:07 nemequ

@nemequ is it good enough to check the corresponding box above if there is a WASM implementation of the _ps version of an intrinsic and the _ss version calls simde_mm_move_ss + the _ps version?

Example for simde_mm_sub_ss: https://github.com/simd-everywhere/simde/blob/e090746b7079bf7c6b85b71d05a8732f84779436/simde/x86/sse.h#L3916

If not, I may have been too enthusiastic and I'll go uncheck some boxes :sheep:

Dec 29 '20 09:12 mr-c

simde simde copied to clipboard

WASM SIMD implementations of SSE

simde
simde copied to clipboard