ecmascript_simd icon indicating copy to clipboard operation
ecmascript_simd copied to clipboard

How will native code port on top of JS-SIMD?

Open juj opened this issue 10 years ago • 39 comments

With Emscripten, we have the capacity to port native C&C++ code to the web. When/if people read tweets along the lines of "JS has SIMD", it will invariably result in a stream of Emscripten developers attempting to port their MMX/SSE1/SSE2/... -based codebases over to JS-SIMD. We need to have an answer to these developers about what the support of mapping these constructs over to JS-SIMD looks like.

In the Emscripten compiler, we already have small bits of such SIMD support available. To chart what this mapping would look like for SSE1 in particular (focusing on just one instruction set spec to start with, and SSE1 is the most interesting one) when completed, I wrote up this spreadsheet: https://docs.google.com/spreadsheets/d/1QAGGf2M2IA6l4cvh8eTXdXGEUcPjdmTe_BLKGn5YCB4/edit?usp=sharing

As one can imagine, comparing the current spec and the set of SSE1 intrinsics listed in the above spreadsheet, there is a large gap. I wonder how this could be resolved?

juj avatar Sep 15 '14 08:09 juj

@juj , thanks for the spreadsheet. It is very informative! I happened to work on JS-SIMD in emscripten a bit. As my understanding, emscripten generates SIMD.js code from 1) LLVM vector types (<4 x i32>, <4 x f32> and <2 x f64>) and operations; 2) emscripten builtins, say emscripten_float32x4_min etc.,; So to fill the SSE1 intrinscis gap, there could be two ways:

  1. map SSE1 intrinscis to LLVM vector type operations, and if there are no direct mapping, it might need some helper code
  2. expose SIMD.js via emscripten builtins, then map SSE1 intrinscis to these emscripten builtins.

If there is a need to add new SIMD.js API, we need to consider it in JavaScript API design principle, say cross architecture for example, and open a specific issue here for discussion.

Your thoughts?

I found your filed issues (https://github.com/kripken/emscripten/labels/SIMD) in emscripten repo. I am willing to work with you to fill the gap. Let's see how far we can go. :)

huningxin avatar Sep 15 '14 21:09 huningxin

Looking at the current code, I am very worried about the potential need of such "helper code". Also while assembling the SSE1 support spreadsheet, I could not imagine how to support that API without performance cliffs. The front page talks about a "straw man proposal", so let me try to attack that here, somewhat boldly if that's ok:

Has it been considered that the JS-SIMD spec would directly add the SIMD intrinsics as-is from the instruction set to the spec? That is, after adding the new SIMD types (including int64), we would have SIMD.SSE1.load_ps, SIMD.SSE1.load_ss, SIMD.SSE1.loadh_pi, and so on (or simply via SIMD.load_ps without the extra .SSE1.), and the same for other intrinsic sets and NEON? Then a common "mapping" of overlapping functions would be layered on top, e.g. in the namespace SIMD.common.xxx (or simply document which of the SIMD.xxx are common), for people who want to write one SIMD code to work on both SSE and NEON. This would have the following advantages:

  • it would provide the functionality as-is like the hardware has it, unchanged.
  • there will be a good contract of explicit performance. I.e. calling a function, one knows what he will be getting, like with native SIMD. There won't be hidden under-the-hood instructions generated.
  • developers can reuse existing documentation from Intel and ARM for SIMD intrinsics, since the API is identical.
  • the API would match 1:1 with features provided by hardware (e.g. SIMD.SSE3.xxx will be available iff the target hardware supports SSE3)
  • developers will have the power to choose what kind of compatibility to target: if they want their code to run across SSE and NEON, they can limit to the documented common overlapping feature set, or if they want to target a specific SSSE3 or SSE4.1 functionality, they can also choose to do so themselves.
  • also developers can reuse existing code samples, and transform existing native SIMD code algorithms and snippets over to JS systematically.
  • the web working group does not have to invent a new SIMD instruction set, and in turn, developers will not need to learn a new SIMD instruction set and the surface area of new required documentation will be much smaller.
  • it would offer a suitable compile target for Emscripten, and directly usable in existing native code that uses SIMD, like Vorbis and Theora, or games and game engines like Unity3D or Unreal Engine 4.

Developers who wanted to use e.g. SSE2/3 but still have their code work on NEON and not break, we could offer an API like SIMD.allowSoftwareEmulation() which enables all functionality, but implemented in terms of another SIMD instruction set or software in the absence of the real thing. Or alternatively (and perhaps simpler), offer a JS polyfill library on how to implement those functions.

There are hundreds of different domain areas that utilize SIMD in different forms. I worry that if for example some of the SSE1 intrinsics (or MMX, SSE2, or NEON) are left out from the spec, we will need to soft-emulate the functions in code when compiling with Emscripten, which can easily become catastrophically slow. This will lead to the developer to need to adapt his code so that his SIMD only uses the "native" JS-SIMD feature set, which in turn will create a big need to author new "JS-SIMD porting guide and SSE1/SSE2/... emulation tips" documentations on how existing SSE algorithms should be rewritten to JS-SIMD, and what might be supported and what is not. If instead JS-SIMD would run a sequence of one or more of instructions to emulate that the browser needs to do under the hood, it will lead to failure when the performance is not like the developer expected. Currently I see that there is already pressure for JS-SIMD to jump over the fence to cater to domain-specific areas like #58, and if the spec offered the direct hardware instructions, solving a problem like #58 would be easy for the developer to do himself, like he does in the native world. Reading the issues in the tracker, I see that a Mandelbrot code sample has been used as a test, but I think that the real test should be to use different applications, which means that in addition to simple amplified for-loop processing (parallel for, autovectorization), it should stress audio/video decoding (Vorbis/Theora et al.), image processing (RGB<->YUV/RGB888<->RGB565, color->grayscale, gamma adjust, ...), string and block ops, raytracing and games (micro-interleaved SIMD ops), to name a few.

The approach that was taken with WebGL was to "just copy GLES2, and make sure it's safe" and it was very successful. I think the same should be done with SIMD: just copy the intrinsic APIs over, and make sure it's safe. That will give us comfort that all of the abovementioned domain areas will be catered for, since the native world has already proven that, and the performance will be equally as good, since the hardware mapping is explicit. The only purpose of SIMD is performance, and I think that we will fail unless the spec can deliver that uncompromised, in explicitly written down guarantees like "this function will compile down to this SSE/NEON instruction".

This turned out to be much longer writeup than I thought and I'm sure that this discussion has already been out there, so thanks if you managed to read it all the way to the end! Emscripten will be one of the heavy users of SIMD, and we already have more than a dozen codebases that mostly use MMX, SSE and SSE2, and they would happily flip the switch if they could compile over to JS-SIMD, so that's good to keep in mind when the spec evolves!

juj avatar Sep 16 '14 10:09 juj

@juj - thanks for writing that up.

If one believes (as I do) in the extensible web manifesto, then there is a tension between exposing native, hardware-specific capabilities on the one hand; and trying to unify hardware under a common API on the other.

Unification looks good at first blush, but if the portable path is intersection of divergent hardware architectures' low-level APIs or ISAs, then the result will not compete with "Native code", and so it will not advance the Web -- more likely it will hold back the Web vs. Native via indirect (opportunity) and even direct (intersection implementation) costs.

Unification via "union" rather than "intersection" means performance cliffs for low-level interfaces such as SIMD, which are worse than the alternative you outline. No one wants cliffs.

This leaves hardware- or ISA-specific APIs. Developers can then adapt higher-level libraries based on what is available (good old "object detection"). Just as native code developers have always done. And similar to how JS developers have coped with browsers and hardware across time and space.

This is where I land, too. Comments from others more than welcome. We probably need an es-discuss thread or three to really thrash this out, but I'm happy to start here.

/be

BrendanEich avatar Sep 17 '14 00:09 BrendanEich

I'm open to the idea of ISA-specific APIs. That's an interesting conversation to have. However, it still makes sense to have an "intersection"-ish API to serve as a common shared base, which is roughly the current SIMD spec that's in progress here today. I can see both styles of APIs coexisting, and even complementing each other. Developers could choose to use the portable API when they want to run well everywhere and don't need platform-specific features, and the hardware-specific API when they feel that's appropriate, or mix the two to make their own tradeoffs.

And so, I'd also like to continue to make progress on this "intersection"-ish API we have here, regardless of the direction of the "union"-ish API conversations.

sunfishcode avatar Sep 17 '14 18:09 sunfishcode

I agree the intersection, of stuff that we want to guarantee runs well across all major SIMD implementations, is most important here. But I also see the motivation for something like SIMD.SSE1 etc. So what about this as a possible "compromise":

  1. We spec and implement the intersection (what we are already doing). This is going to be fast on all major CPUs, in as guaranteed a way as we can do on the web.
  2. We implement a semi-official JS "polyfill" for SIMD.SSE1. It uses the specced SIMD API under the hood (where possible). This means that it works in all browsers, as it is "just" a polyfill. However, by having this be a semi-official way to represent SSE1 etc. operations, browsers might take care to optimize it well. In the limit, a browser could make sure that those patterns are actually optimized down to the relevant SSE1 operation, when SSE1 is available, because semantically those patterns are identical to an SSE1 operation.

This does lack a guarantee of actually getting SSE1 when you ask for it, but then you wouldn't get it if your website happens to run on an ARM phone either...

kripken avatar Sep 17 '14 18:09 kripken

@kripken: the problem with 2 is that object detection, even without fallback, is better than a perf cliff. The app that doesn't also code for NEON will just fail to start (it should have arranged to be Intel-only before running, anyway).

I like common portable APIs, don't get me wrong -- so I agree with @sunfishcode that where hardware has a viable intersection semantics, we should have a generically namespaced API. That has no perf cliff problem.

Is there a non-cliff, a "perf hill", that you think could be tolerable enough to be preferable to no-service on the "wrong" arch?

/be

BrendanEich avatar Sep 17 '14 20:09 BrendanEich

Great discussion all, and thanks to @juj for the initial post.

Since the processors will evolve in time, in future we might see SIMD capabilities (some evolutionary, some more radical) that do not exist on any of the processors today. Inevitably, SIMD.JS needs to evolve as well. One option would be to bring in a group of new capabilities in each generation of SIMD.JS. For instance, now we may start with a set from the existing common capabilities of the processors as well as judicious select instructions/capabilities that are justified because of their dramatic performance impact for certain application domains (do not want to miss them and they do not pose performance cliffs). This would be analogous to the first version of SSE in the native world. Of course, for the first generation of SIMD.JS, we are not restricted to SSE, SSE2, etc. In other words, at each new generation of SIMD.JS, through the collective agreement of the community, we bring in new capabilities that are considered necessary, very helpful, etc.

Now, we are bringing in the SIMD object. Later on, we can add SIMD2 object, and so on. Ideally, it would be required to have backward compatibility: SIMD_n implies availability of SIMD_m for m < n. That way, object detection would also be very practical. Again, there may not be any connection between SIMD_n and SSE_n.

This is approximately the way the native world of each CPU platform works today and it seems a plausible approach for the web.

So, now we should decide what should come in at the first stage. This seems better be application domain driven. I don't consider all SIMD instructions equally important, some are more equal ;)

-moh

mhaghigh avatar Sep 18 '14 03:09 mhaghigh

@BrendanEich , not sure I follow you? There is always going to be a perf cliff in some case here. If someone writes code specifically for one CPU's SIMD, and it runs on another, the polyfill or the browser will have to implement the right semantics in a likely slower manner. For that reason it seems risky to put one CPU's specific operations in an official spec. But a semi-official polyfill that browsers are free to implement or not, is within the realm of normal optimization unpredictability on the web. Or do you mean a different type of perf cliff here?

kripken avatar Sep 19 '14 00:09 kripken

@kripken: It would help if I defined "perf cliff": I mean the code works everywhere, but terribly slowly on some platforms, untenable slowness (4x slowdown counts? I think so).

The cases I see, to repeat in case I was unclear (quite possible!) are:

  1. Portable intersection semantics (no perf cliffs, but intersection could be too small a set).
  2. Portable union semantics (emulations with perf cliffs).
  3. Non-portable union among top desktop+mobile SIMD ISAs (no perf cliffs, see below).

Obviously combinations are possible, and good. As with WebGL = OpenGL ES2 (currently), a strong enough (1) wins for many cases.

But SIMD and desktop/mobile divergence make me think (1) + (3) is strictly better, and worth the risk of non-portable JS being written. Let the github hordes help us discover the future (1), rolling up what wins from (3) and co-evolving with the hardware.

I'm assuming hardware vendors pay attention to what developers do with (1)+(3). I'm also giving devs the advantage, since web devs number ~10M vs. ~500K native devs. Check my numbers!

/be

BrendanEich avatar Sep 19 '14 01:09 BrendanEich

It's possible the perf cliffs would be small if enough people make sure to support top SIMD ISAs, yes. That leaves new ISAs, but as you say, hardware vendors are likely aware of this stuff. But, the main concern is if people just write to one CPU. If it's a github library, then collaboration can fill in the holes, but in a specific app, they may well just focus on their main market (one CPU/browser/OS maybe).

This seems unavoidable anyhow, though. Safari's FTL uses LLVM which can autovectorize, and that approach may increase in parallel to the SIMD.js API. Autovectorization will always have such perf cliffs. So for that reason I am not too worried about adding non-portable things. However, I do feel that putting such non-portable things in a spec is troubling - for that reason I was suggesting it be in a semi-official library on the side. A library or autovectorization can also lead to perf cliffs, but are less problematic from a standards perspective.

Overall I think there is little difference between our positions. Perhaps I am focusing too much on small details.

kripken avatar Sep 19 '14 22:09 kripken

@kripken: it's true, unless a particular instruction available only in one arch were on a super-critical path, the cliff might be much less than 4x for the macro-benchmark. Hard to say without concrete instruction, emulation, and macro-benchmark.

It would be helpful to me at least to see the NEON version of

https://docs.google.com/spreadsheets/d/1QAGGf2M2IA6l4cvh8eTXdXGEUcPjdmTe_BLKGn5YCB4/edit#gid=0

and then the union and intersection, or at least their sizes. Is anyone doing that?

/be

BrendanEich avatar Sep 19 '14 22:09 BrendanEich

Thanks for all the discussion here!

I filled out the spreadsheet on the SSE1 support page to add a new column on how those SSE1 instructions map to NEON.

If one is looking for a strict set intersection at the intrinsics API level between NEON and SSE1 only, where the semantics are exactly identical (ignoring flush-denormals-to-zero and hardware fp exceptions) then, if I got it right, it is equal to the following functions:

_mm_loadu_ps = vld1q_f32
_mm_set1_ps = vdupq_n_f32
_mm_storeu_ps = vst1q_f32
_mm_add_ps = vaddq_f32
_mm_mul_ps = vmulq_f32
_mm_sub_ps = vsubq_f32
_mm_and_ps = vandq_u32 + vreinterpret_q
_mm_or_ps = vorrq_u32 + vreinterpret_q
_mm_xor_ps = veorq_u32 + vreinterpret_q
_mm_cmpeq_ps = vceqq_f32
_mm_cmpge_ps = vcgeq_f32
_mm_cmpgt_ps = vcgtq_f32
_mm_cmple_ps = vcleq_f32
_mm_cmplt_ps = vcltq_f32

If I missed something, please help complete the chart in the SSE1 spreadsheet page. As one can see following the spreadsheet, the overlap is very small.

That set intersection is perhaps barely suitable for autovectorization, which I see as a very small and uninteresting part of SIMD that at most applies to problems that one could call "embarrasingly SIMDable". Most applications of SIMD are outside that scope, so I think any kind of core minimal required set intersection vs optional SSE/NEON extensions approach will not work.

Dan rather excellently provides an example of my greatest fears in #67 . Thanks Dan for the research there! In #67, we are asking whether we should write the spec of the min() function according to how NEON works (and then x86 suffers a perf hit) or how SSE works (then NEON would suffer a perf hit), so it becomes a question of which one to favor at the expense of the other platform. This is the situation I would like to avoid at all costs: with that kind of interface, the compiler will have to insert instructions under the hood to satisfy the requirements put forth by our JS-SIMD specification. One might think that a few extra instructions is not bad, but in that example, running three instructions instead of just one is a +200% slowdown impact. But it gets even worse, since Dan picked the NEON max instruction there, how would we implement the semantics of _mm_max_ps on top of that api? This would require us to doubly-emulate the semantics, when Emscripten emulates SSE max on top of NEON max, which emulates back to SSE max inside the browser. The slowdown can easily be 10x or more for a single instruction.

Native developers enjoy the following advantages when writing SIMD:

  1. The developer can choose which SIMD set to target by choosing the intrinsic functions to use.
  2. The intrinsics are strictly documented to specify which hardware instruction they will run. (the few that don't, like _mm_set_ps, are helper ones for predictable instruction patterns)
  3. The developer can (and will!) verify what he got by investigating the disassembly of the generated code.

In the native world, the only reason that developers accepted intrinsics and everyone doesn't still write SIMD by hand in assembly is the combination of 2 and 3. These provide a way for the developers to understand where their performance is going. I would argue that in order to deliver to par with native, we need 2 and 3 as well. Currently, web does not have any kind of history with 3, which would be especially important to have if the compiler has to do compatibility emulation like #67 under the hood. Otherwise we might be providing developers with an unpredictable black box one can't reason about, and a "trust us, we picked the fastest sequence for you" argument is outright patronizing. Also, I'm a bit worried about 1. Dan's fastest options for #67 are when AVX or SSE 4.1 is available, and require a fallback on older SSE sets - but the max function was a SSE1 operation to start with!

I agree that it is critical that we have consistently computed results across platforms. The more I think about this, I think we should abort any attempt to merge the SSEx and NEON instruction sets together into a new overarching JS-SIMD API. Instead we should offer all the intrinsics as-is without trying to come up with a merged API, especially if that would mean compromises like #67. To solve #67, we would have SIMD.SSE1._mm_max_ps and SIMD.NEON.vmaxq_f32, which would compute the maximum using the full semantics of either. This way we would not do favors or disfavors to either x86 or ARM by choosing an official instruction, and we would give the guarantee of a direct mapping where available. It would also solve the double-emulation problem from above. The results would be consistent: you can run SIMD.SSE1._mm_max_ps on ARM devices as well, and it will use the best possible SSE-over-NEON sequence we know is possible (either as a polyfill, or as implemented by the browser), using the exact semantics delivered by SSE1. As a bit of extra, we would provide an API which allows querying which SIMD instructions sets the current hardware directly has available, so that user code can choose which path to take.

I see that would be the perfect performance + cross-platform compatibility solution. What kind of arguments are there against this kind of approach?

juj avatar Sep 22 '14 20:09 juj

On the topic of _mm_max_ps in particular:

We won't be mapping _mm_max_ps onto the JS-SIMD max function. _mm_max_ps has defined behavior on NaN, and no matter what we do in #67, it won't make sense to do max + extra stuff when we can just do select(greaterThan(x, y), x, y) (or something very close to that). JITs can even pattern-match that down to a single maxps instruction on x86 if they wish, and even if they don't it's still only about 4 instructions or so (and fewer with SSE4.1+). On ARM this will just be a compare and select, 2 instructions if I'm not missing something, which isn't terrible. There won't be any 10x slowdowns or double emulation for min or max.

On the topic of intersection versus platform-specific API approaches:

As I said above, I'm open to discussing platform-specific APIs. It's bold, and it's good for us implementers to hear from this perspective, and it's a great conversation to have.

However, even if we do do that, there is still significant utility in a common intersection API, which should include all the stuff in your strict intersection list above (thanks for compiling that list, btw!), and also several things that are "pretty close", which I would say includes shuffles, swizzles, min/max, and perhaps some other things. In any kind of real code that can stay within this intersection, I expect average overhead will usually be lower than 200%, because the most common things are all still single-instruction.

I'm aware that there is a class of developers who define success in terms of the percentage of some theoretical peak of the hardware they have pre-selected for the software to run on, and they may feel that they cannot possibly be successful with this API. However there are also developers who would be happy to write code that simply runs several times faster than scalar code on any decent SIMD-capable CPU, present or foreseeable future, and they will find they can do a lot with this API. I'm even hopeful that we can do a good enough job in the intersection to appeal to a fair number of people in the middle of that spectrum as well.

I also don't want to live in a "trust us we picked the best instruction" world. I think part of the answer here is that we should ideally improve our tools for allowing developers to inspect the assembly code generated by the JIT. Part of the answer may be that we have platform-specific SIMD APIs along side the portable API. Part of the answer may be that there will hopefully be some JS-SIMD benchmarks that we can compare across implementations.

sunfishcode avatar Sep 22 '14 22:09 sunfishcode

As a follow-up, I just added implementations of _mm_max_ps and _mm_min_ps to Emscripten's xmmintrin.h using compare+select as described above. This makes the NaN and -0.0 handling exactly match that of x86, which is what the <xmmintrin.h> API wants, and it avoids the double-emulation problem.

sunfishcode avatar Sep 23 '14 01:09 sunfishcode

For reference, here is the commit mentioned to above: https://github.com/kripken/emscripten/commit/8c8c7fd3ac716f20c21a8edee9e2010d672d76d5 . The select(greaterThan(x, y), x, y) set of instructions would directly map to

movaps mask, x
cmpps mask, y, GT
andps x, mask
andnotps mask, y
orps x, y

which is five instructions, and requires one extra temporary register compared to maxps x, y. I don't think that would be good performance in any scenario. The proposed solution that "JITs can even pattern-match that down to a single maxps instruction on x86 if they wish" feels like the wrong direction for the spec, because:

  • that is the opposite of the "explicit performance guarantees" notion.
  • Firefox might do the pattern-matching but other browsers might not, leading to a source of performance differences across browsers.
  • pattern-matching might complicate JS VM development and slow down runtime JIT compilation times.
  • this might be source material for "arcane magic"-like performance tips guides for JS-SIMD, where people would document paragraphs like If you don't care about NaN handling, for best performance, avoid calling SIMD.max() when running on x86, since it emulates an ARM NEON max instruction. Instead, prefer the sequence SIMD.select(SIMD.greaterThan(x, y), x, y) to match SSE semantics and have a JIT generate the actual SSE max instruction.
  • A developer will need a JS-SIMD disassembler (point 3 from the earlier post) to be able to confirm if he got the intended sequence or not.

I hope that we would have as few needs to pattern-match in the JIT as possible to deliver performance (except for any usual register allocation that takes place in the compiler). These feel like fixing up after the fact that the interface was not expressive enough. Strictly for Emscripten purposes I think it might work, since we control both sides of the fence and can make sure that they evolve hand-to-hand, but for the general web, I think that would be a disservice.

Would it be possible to assemble a spreadsheet, where all JS-SIMD API instructions are listed in one column, then in another the assembly sequence that they compile down to on x86, and in a third column, the assembly sequence that they compile down to on ARM? I think that would be very important to see, even if it wouldn't end up being an official part of the spec.

juj avatar Sep 23 '14 09:09 juj

I contend that portable but 2x or slower is a non-starter. SIMD-based C/C++ code can't tolerate it when cross-compiled if the resulting JS is to compete with native and provided the slowdown dominates total runtime, or merely is bad enough that users notice and object or seek native code.

Sorry if I'm missing something -- @sunfishcode, please help me see why architecture-independent API with 2x or greater slowdown is worth doing, in a competitive case analysis. I can see that it's better than no SIMD, but I then argue it's not competitive with native.

Telling devs who can't take the slowdown to try the architecture-specific APIs is risky: you probably lure some devs into wasting human coding cycles, finding perf loss unacceptable, and then rewriting. In general "make it right, then make it fast" -- but we are not in a general code regime, we're dealing with (a) SIMD intrinsics in Emscripten source, and (b) winning over low-level hackers who use C/C++ to use JS as well and with the same guarantees.

Of course, a slow portable API could be good enough if the slowdown hits only a small part of the total schedule. But then how important was SIMD to such a program in the native case?

/be

P.S. JIT pattern-matching in competitive regimes works: engines level up to tie or win at benchmarketing and/or "design wins" in sales ($0 but still) settings. But JIT pattern-matching is a sideshow if our goal is to compete with native, where hackers hand-select SIMD instructions to get best performance.

BrendanEich avatar Sep 23 '14 19:09 BrendanEich

I expect we'll beat 2x in many cases with the portable API. Even though they have dominated the discussion here, min and max are a sideshow compared to add and mul.

That said, I expect I'm not going to be your main challenge to convince about doing a "union" API. I am talking with people I know to learn what people think about the idea, and I encourage everyone interested in this to do the same.

sunfishcode avatar Sep 23 '14 22:09 sunfishcode

My two cents: unless we choose a truly union API (exposing every SIMD op up to the bleeding edge), there will always exist cases where the best JS-SIMD can do is 4x slower than the best native can do. I think the important thing isn't so much % coverage of the instruction set but % coverage of real world use cases.

Also, I think it makes sense to avoid hidden performance cliffs by not supporting automatic translation for 100% of mmintrin.h but, rather, having an mmintrin.h-derived "emscriptintrin.h" that contained only the ops in JS-SIMD. I think "write new SIMD code for new platforms" is part of the usual porting story for applications and so requiring a rewrite for emscriptintrin.h doesn't seem unreasonable. Also, this will help make it clear what JS-SIMD supports and help collect feedback for future iterations.

I do assume, though, that pretty quickly we'll want ops that are only fast on one arch. In that case, I think we should expose this fact through feature testing. Rather than separating by instruction set, I was thinking perhaps we could use a scheme:

  1. SIMD.{float32x4, ...}.* : ops that are optimal on both SSE/NEON
  2. SIMD.arch.{float32x4, ...}.* : ops that are optimal only on the current device
  3. SIMD.simulated.{float32x4, ...}.* : the union of all SIMD.arch.* ops; not necessarily optimized

Thus, 1 is the intersection, 2 describes the current device and 3 is the union. Only ops in 2 need feature testing and a good portable implementation would start by feature testing SIMD.arch before falling back on an implementation using a mix of 1 and 3. The point of 3 is that, with the full instruction set at its disposal, the JIT should be able to do a better job at simulating ops than JS could in terms of JS-SIMD (still achieving a speedup over plain scalar code).

Applying this to the current situation with min/max, we could consider:

  1. SIMD.float32x4.minNoNaN - undefined what happens with NaN (fast on SSE/NEON)
  2. SIMD.arch.float32x4.{min, minAsymmetric} - the former available on NEON, the latter on SSE
  3. SIMD.simulated.{min, minAsymmetric} - call whichever you want, it's slower on the other arch This gives the programmer maximum control over saying what they want.

But maybe this is overkill, though? Definitely it's overkill if min/max are the only use cases; I think we need more iteration with what we have now, in the intersection API, to know what the situation really is.

There is also the issue that I would expect an intersection API to be much easier to initially get into the standard and implemented (as long as it was shown to have enough ops to be generally useful). Once we have this foot in the door, it seems like we'd be able to iterate quickly on SIMD.arch ops. With feature-testing, browsers would starting getting the fast paths as soon as they implemented the new ops.

ghost avatar Sep 24 '14 02:09 ghost

One other thought I had, regarding the "WebGL was a success by exactly modeling the underlying API" line of reasoning:

WebGL emulated OpenGL, which itself had already done the work of providing a device/manufacturer-independent graphics hardware abstraction. If WebGL had followed the pattern we're discussing here with SIMD.SSE.x/SIMD.NEON.y, we'd have two similar-but-different interfaces for OpenGL and DirectX and Microsoft would have never optimized for GL.

Similarly, if we can provide the developer a way to test which operations are efficient (1:1 with machine insns) (analogous to WebGL's extension-testing I think?), then it stands to reason that both Intel and ARM might, in the future, evolve to support the others' optimized ops. With feature-testing, we'll just enable these ops after doing cpuid festing and existing code will just run faster. Over time, JS-SIMD could end up being an OpenGL-like force that promotes SIMD feature convergence.

ghost avatar Sep 25 '14 14:09 ghost

@juj, @andhow, and @kripken and I discussed this earlier today. The conclusion was that if we're going to embark on a bold new strategy here, we'll need some compelling arguments to motivate it, and the best argument for this kind of thing is data. So, when Emscripten+OdinMonkey are ready to rock some intersection-style SIMD together (and this is coming soon!), we'll compile some code and do some hopefully realistic benchmarking and, in general, collect some real data. What works, what doesn't work, what's fast, what's slow, what's easy to fix, and what's a lost cause. Then we'll be able to make a more informed decision, and if we need to do something bold, we'll be able to explain our choices to others with data to back them up.

sunfishcode avatar Sep 26 '14 00:09 sunfishcode

So, when Emscripten+OdinMonkey are ready to rock some intersection-style SIMD together (and this is coming soon!), we'll compile some code and do some hopefully realistic benchmarking

So excited about that!

huningxin avatar Sep 26 '14 02:09 huningxin

This discussion is excellent. Thank you all. @sunfishcode and I have also argued on this topic in #asm.js. Allow me to make my case here too.

The only reason to use SIMD is performance.

The challenge for a great deal of SIMD algorithms is arranging data: gathering, mixing, and splatting into the appropriate lanes, doing a tiny bit of SIMD work, and scattering the register lanes back out into memory.

Sometimes, after taking a scalar algorithm and applying SIMD, you might merely see a 2x performance increase. Perhaps even less. Rarely will you see the maximum of a 4x increase.

Thus, it's likely that any additional instructions emitted for the sake of consistency-under-NaN across ISAs will ENTIRELY offset the gain of going SIMD in the first place.

My recommendation is to leave SIMD semantics under NaN unspecified or implementation-specified for maximum performance.

chadaustin avatar Sep 26 '14 22:09 chadaustin

Hi @chadaustin. I see that you're passionate about this issue, which is great, because we would benefit from some help :-).

One thing that would help would be testcases, preferably code we can run, but pseudo-code or just a description of an algorithm can also be useful. The stronger connection to a real-world use case the better.

The NaN consistency issue is disproportionately represented in min/max, so I'm likely to face someone claiming that the concern is overstated because a real-world testcase would do things other than just min/max. How should I respond to that? A testcase demonstrating a real use case where the NaN consistency issue causes significant slowdown would be a powerful motivator. Thanks!

sunfishcode avatar Sep 29 '14 14:09 sunfishcode

@andhow: I think the important thing isn't so much % coverage of the instruction set but % coverage of real world use cases.

I find that statement a bit objectionable. In the recent meeting, @sunfishcode asked me to come up with real world use cases to motivate why a union api is needed, or why direct SSE support should be added, and while I'll do my best to provide such data, I think it would be presumptuous or outright arrogant if that data will be later on used as an example to separate "this is the important part of SSE" and "this is the part of SSE we don't need to care so much about" categories. If we were designing a new specification, I would agree, but I think here we are following the native world to bring over a feature from the native world that proved successful there. Also, since the real world use cases are based on the native world specifications, we are guaranteed that if we can match the native instruction sets, we will also win over the real world use cases. The number of instructions in the set is very small compared to the number of applications that have been written on top of the instruction sets.

@andhow: Also, I think it makes sense to avoid hidden performance cliffs by not supporting automatic translation for 100% of mmintrin.h but, rather, having an mmintrin.h-derived "emscriptintrin.h" that contained only the ops in JS-SIMD. I think "write new SIMD code for new platforms" is part of the usual porting story for applications and so requiring a rewrite for emscriptintrin.h doesn't seem unreasonable.

In Emscripten community, I am one of the big proponents of adding Emscripten-specific APIs, and I do a lot of the work involved in designing and implementing those (to which @kripken likes to object :), but SIMD is not one of those areas where that makes sense. Asking users to rewrite their SIMD code would make sense if we were dealing with a new platform that actually had new SIMD hardware in place, but we don't. If we have an application that is written to talk SSE, and it is being run on a processor that talks SSE, it will be a very hard sell to tell a developer that on the web these can't connect directly, but he must rewrite the code (assuming it is even possible if the JS-SIMD instruction set is too limited), and the end result he will get won't be as good as what direct SSE-to-SSE in native is.

The title of this issue is specifically "How will native code port on top of JS-SIMD?", and by this, you are proposing that native code should not. I don't see that reasonable. If we do tell developers that their SSE code (or NEON code for that matter) will not apply to the web, we have conceded that JS-SIMD does not support the native code porting use case.

In the webkit mailing list, there was an argument that JS-SIMD should not even exist because SIMD is not performance-portable. The bit about performance-portability is absolutely true. But I see that as a fact which native SIMD developers routinely deal with and there are no problems associated with it. For native developers, it is not a problem that different hardware has different performance characteristics, since the developer has direct access to each hardware and has the tools in his toolbox to design for this:

  1. Native developer recognizes that the problem he is solving is representable in the SSE and NEON intersection (for the set of problem input values he cares about), so he simply aliases the operations under a common interface (SSE: https://github.com/juj/MathGeoLib/blob/master/src/Math/simd.h#L57 , NEON: https://github.com/juj/MathGeoLib/blob/master/src/Math/simd.h#L231), and then writes one algorithm using that common interface, that works on both: https://github.com/juj/MathGeoLib/blob/master/src/Geometry/AABB.cpp#L475 . He will get the 100% native performance on both SIMD instruction sets, since the instructions map 1:1 to the underlying hardware instructions.

  2. Native developer recognizes that the problem requires a different approach for both SIMD instruction sets. He conditions the code to take a branch in a cold part of the code, to jump in the hot path:

    function SolveProblem(input)
    {
      if (SupportsSSE())
        return RunHotAlgorithmWithSSE(input);
      else if (SupportsNEON())
        return RunHotAlgorithmWithNEON(input);
      else
        return RunHotAlgorithmWithScalar(input);
    }
    
  3. Native developer recognizes that there are too many paths where different approaches are needed for SSE and NEON and that such runtime if-else branches are not feasible to maintain without performance loss. He recompiles the code for both platforms:

    function SolveProblem(input)
    {
    #ifdef SUPPORTS_SSE
        return RunHotAlgorithmWithSSE(input);
    #elif defined(SUPPORTS_NEON)
        return RunHotAlgorithmWithNEON(input);
    #else
        return RunHotAlgorithmWithScalar(input);
    #endif
    }
    

JS-SIMD is currently trying to specify a set intersection of instructions, along with an emulation layer to make the set intersection absolutely consistent across ARM and x86 hardware. That will lend to partially enabling the first category, but without a direct 100% instruction-to-instruction performance mapping guarantee. We have a native world full of code that already solves the performance portability challenge via second or third categories, but with the current JS-SIMD, we would not be able to reuse those solutions. If the JS-SIMD spec gave direct access to the instruction sets, web developers would be able to reuse the same design tools that native developers have, and the performance-portability problem would be equally manageable problem for the web as it is for native developers today, or perhaps even easier, since 4. with the help of polyfills, web developers would have the extra ability to write SSE code that actually runs on NEON, and vice versa, in which case the browser could emulate the best closest thing. In that case, developers would be happy instead of angry, since they understand that emulation is reasonable: if I wrote an application that talks only SSE, and I'm running it on a NEON chip, of course I can expect performance loss.

With a direct intrinsics level api, the web developer can choose from 1. - 4. to decide which will give the best performance.

@andhow: Similarly, if we can provide the developer a way to test which operations are efficient (1:1 with machine insns) (analogous to WebGL's extension-testing I think?), then it stands to reason that both Intel and ARM might, in the future, evolve to support the others' optimized ops.

This is an argument that I am not capable of predicting, but I think it's fair to agree that we should design the spec for the real world today if we want it to have a practical impact in the real world now. I would rather we solved this problem in the spec ourselves, instead of waiting if perhaps that the hardware industry would change around us to remove this problem.

@chadaustin: My recommendation is to leave SIMD semantics under NaN unspecified or implementation-specified for maximum performance.

If we modelled direct SIMD intrinsics access, we would not have unspecified or implementation-specified behavior. I think that would be better for the web, as both SIMD.SSE1.max and SIMD.NEON.max would be strongly specified with semantics of their own and no uncertainties, but still have maximum performance. That way the user would explicitly know if he is getting different results on x86 and ARM, it must have been due to one or more of if (SupportsSSE()) vs if (SupportsNEON()) paths he wrote himself, which would give a stronger clue to tracking the origin compared to USB in the spec (which he might not have read). Conversely, if the developer did not write a single if (SupportsX()) statements in his code, he would be guaranteed to get identical results across ARM and x86, and performance that is dependent on which variant of SIMD functions his app was written in and what his current execution platform is.

@sunfishcode: One thing that would help would be testcases, preferably code we can run, but pseudo-code or just a description of an algorithm can also be useful. The stronger connection to a real-world use case the better.

I wrote an automated benchmark of the current SSE1 api implementation over the weekend. It is available here: https://github.com/juj/emscripten/commits/sse1 . The way to run it is to check out the code, and then run python tests/benchmark_sse1.py, and the test will automatically run, and generate a results_sse1.html page in the current directory. Here is the results of running the benchmark on my system: http://clb.demon.fi/dump/results_sse1_20140929.html . For anyone looking through that link, please don't yet take home any conclusions from the current numbers, since it is not yet an asm.js-validated run.

What the test does is it stresses each individual function of the SSE1 api and times them. It is synthetic, I know, but I think it is a superset of all the real-world SSE codebases, and therefore will have a stronger connection to real-world use cases than any single real-world codebase itself has. I do think that if it is to be rejected as an invalid test case, the reason should be something else than just the label "synthetic" it comes with. I wrote it specifically as a tool to give data for @sunfishcode and @huningxin to use to work on https://github.com/kripken/emscripten/issues/2793, so I'm hoping you'll be willing to approach it with an open investigative mind. If the test is bad, please point the bad parts out, and how we could fix the test up. I believe that excelling in synthetic tests like this one will be a prerequisite for excelling in real world codebases. It is of course not a replacement for real world codebases, but if I had the capacity to only optimize one test case, it would be this synthetic test. Let me know if I can help you run the benchmark on your own systems. Note also that the https://github.com/juj/emscripten/commits/sse1 branch comes with more implemented SSE1 functions than what the current xmmintrin.h file in upstream is. We should try to merge that in soon.

I'll work on investigating C/C++ codebases that we could build for actual real world benchmarks. Video and Audio codecs and FFT come to mind at first, so I'll probably go for some of that field.

juj avatar Sep 29 '14 22:09 juj

Thanks Dan! :)

The major SIMD algorithm from IMVU is already represented in the skinning benchmark at https://github.com/chadaustin/Web-Benchmarks/tree/master/skinning and @huningxin is already looking at that.

I just uploaded another minor one here: https://gist.github.com/chadaustin/0ad326c7e06cda799cf7

There's another one I can't paste publicly but it's basically a blinn-phong lighting calculation with color vectors being accumulated (ambient, diffuse, specular terms), and then saturated to [0,1] with minps and maxps.

Here's a simple triangle-to-depth-buffer rasterizer linked from Fabian's excellent series on optimizing the Intel Software Occlusion Culling demo: https://github.com/rygorous/intel_occlusion_cull/blob/97eae9a8/SoftwareOcclusionCulling/DepthBufferRasterizerSSEMT.cpp#L219

http://fgiesen.wordpress.com/2013/02/10/optimizing-the-basic-rasterizer/

That's all I've got handy at the moment...

I think the problem with saying that minps and maxps are rare is that, while true, any kind of saturating arithmetic inside an inner loop is going to use one of them. Any clamped arithmetic would use both. All just so the spec can precisely-define NaN semantics, which I think it is a bad idea in the first place. :) Then again, I think JavaScript would benefit from a healthy dose of undefined behavior in general. ;)

chadaustin avatar Sep 29 '14 23:09 chadaustin

Thanks @chadaustin, I haven't had time to look at everything in detail, but it looks really helpful!

sunfishcode avatar Sep 30 '14 18:09 sunfishcode

@juj: The purpose of my request for benchmarks and testcases was to allow me to evaluate how good our current design is. If we get data and it exposes minor things that could easily be fixed, we're going to just fix those things. If it exposes a manageable number of bigger things which could be added to the current design, possibly in the manner that @andhow has outlined above, we're likely to just do that. Your proposal above would have much higher costs for us, so while I'm open to it, the data we get here will need to show that we have major problems likely to hit us in important real-world scenarios that we can't fix in simpler ways before I can adopt it.

sunfishcode avatar Oct 03 '14 00:10 sunfishcode

@juj: I should also mention that the SSE1 tests you have here look like really great tools, and I'm definitely looking forward to using them. Being synthetic benchmarks, they'll give us lots of data, and with context and interpretation, such data can be very powerful.

sunfishcode avatar Oct 03 '14 00:10 sunfishcode

I've now worked on the quest to produce real world benchmarks to the extent that I think is useful at this point.

First off, here are some places that were looked but rejected:

  • Ogg Theora: This was my first go-to for a SIMD example, but unfortunately I got to realize that theora is using handwritten MMX assembly, which makes it unsuitable for a benchmark.
  • Ogg Vorbis: Surprisingly it does not contain SIMD, authors on IRC comment that they want to keep it as a clean "reference implementation", and advised to look for SIMD optimized versions elsewhere.
  • Opus Codec (http://www.opus-codec.org): has some SIMD, but authors stated they have their SSE2 optimizations currently under way and not yet complete. This is probably a good benchmark once it is done and worth a revisit.

I've assembled the following two codebases for building with JS-SIMD:

Bullet physics (ammo.js):

  • Worked to patch Emscripten and Bullet in ammo.js repository so that it builds with SSE2 enabled.
  • The Emscripten repository is here: https://github.com/juj/emscripten/tree/sse2
  • the ammo.js repository here: https://github.com/juj/ammo.js/tree/sse .
  • Will likely be a nice benchmark, however there is still some work we need in Emscripten side to be able to get it running (in particular issues https://github.com/kripken/emscripten/issues/3009, https://github.com/kripken/emscripten/issues/2848 and https://github.com/kripken/emscripten/issues/2855 )
  • Build instructions for your own STR at https://github.com/kripken/emscripten/issues/3009

My own MathGeoLib math library: http://clb.demon.fi/MathGeoLib/nightly/

  • Has unit test and benchmarks suite.
  • Online benchmarks suite graphs results like http://clb.demon.fi/dump/MathGeoLib_testresults/index.html?revision=f5646ed848d79f66ea65c22c070967c8e7f1357a
  • This library has strict build modes for <=SSE1, <=SSE2, <=SSE3 etc. support. Currently trying SSE1 build mode only with Emscripten.
  • Builds with Emscripten but does not run due to issue at https://github.com/kripken/emscripten/issues/3010 .

This is where my effort got blocked. We are not yet in a state where we could start building actual benchmark projects, since our support for SSE1 and SSE2 is not yet complete enough. To be able to build benchmarks, we need to resolve the following:

  • merge the SSE1 suite at https://github.com/kripken/emscripten/pull/2792
  • resolve https://github.com/kripken/emscripten/issues/2855 "SIMD code does not validate as asm.js", because the polyfill impacts correctness as NaN canonicalization is breaking loading and storing bitmasks in SSE registers, see https://github.com/kripken/emscripten/issues/2840 .
  • resolve the other blocking SIMD limitations at https://github.com/kripken/emscripten/issues?q=is%3Aopen+is%3Aissue+label%3ASIMD+
  • add SSE2 support and unit test suite similar to https://github.com/kripken/emscripten/pull/2792 does for SSE1.

I also did an audit of the Unreal Engine 4 and Unity 3D codebases for their SSE uses, but the conclusion is that it does not make sense to try to leap to those quite yet because we cannot build smaller examples at the moment. This will be retried again later once our support progresses some more.

I still hold that the synthetic benchmark suite that I wrote at http://clb.demon.fi/dump/results_sse1_20140929.html is the best benchmark we can look at from the spec and porting perspective at the moment, because it is explicitly visualizing what the relative performance between native vs JavaScript and scalar vs SIMD is for each SSE1 instruction. This is the only honest way we have for measuring the performance right now, because it covers the full API.

Going forward, what I would like the working group to decide for the JS-SIMD from the SSE1 intrinsics perspective are the following:

  • which SSE1 instructions will have direct 1:1 instruction level support in JS-SIMD? (addps, ... ?) What is the language syntax to access those instructions?
  • which SSE1 instructions will be inaccessible directly and behind an emulated but reasonably fast path (whatever that means - perhaps "some speedup", or "faster, or at least as fast as scalar")? (maxps, ... ?) What is the expected emulation cost for these? (in terms of instructions, clock cycles, or something else that is at least somehow quantizable)
  • which SSE1 instructions will be put in an expected slow path? (the operation is available in some JS-SIMD expressible way, but there is no performance expected)
  • which SSE1 instructions will not be accessible in JS-SIMD at all? What is the rationale for leaving those out from the spec? (useless/superceded? not important? too hard/too much to add at this point?) What should a developer who is migrating existing SSE1 code over to JS-SIMD do with such constructs?

SSE1 of course not being in any way special, except that it's only the first SSE instruction set, SSE2, SSE3 and later are sure to follow as we build on the support in Emscripten, but just for sake of keeping a scope for the discussion since the instruction sets are large.

The reason I am asking these kind of questions is not to poke into any extra undesirable exercises, but simply because I know I will be the contact person for dozens of developers who will port their native codebases over to JS-SIMD with Emscripten, and I can anticipate that these types of questions are exactly what they will posing. We will need to provide the necessary support material for such Emscripten developers or these ports will just not happen, so to me it makes sense to ask these already before the spec is finished so that we can say that the spec was designed to have the proper answers for each. Is this something that the working group could do? I think this is similar in scope to the WebGL Specification section 6 "Differences Between WebGL and OpenGL ES 2.0" at https://www.khronos.org/registry/webgl/specs/latest/1.0/#6 .

Also I'd like to see that the designed status of the instructions in the fast vs slow paths is reflected in the synthetic SSE1 benchmark, which it isn't yet. I would like to understand that if we are going for the "set intersection" API for JS-SIMD v1.0, how does it look like in the optimized version of the synthetic SSE1 benchmark. Currently the version I am able to run is ranging between 50x vs 1000x slower than native for some instructions because we don't have a fast asm.js validating path yet available. How fast can we get these in the current state of the spec?

juj avatar Nov 21 '14 01:11 juj

Side note, I tried the bullet3 SSE path (native) on Linux before. However I didn't get good speedup there. See https://github.com/bulletphysics/bullet3/issues/66 for details.

huningxin avatar Nov 21 '14 07:11 huningxin