ecmascript_simd icon indicating copy to clipboard operation
ecmascript_simd copied to clipboard

SIMD.int8x16.sumOfAbsoluteDifferences on ARM

Open sunfishcode opened this issue 9 years ago • 25 comments

@simhos01 pointed out that the SIMD.int8x16.sumOfAbsoluteDifferences function is not efficiently implementable on NEON. We should determine whether an alternative interface would be more suitable.

sunfishcode avatar Apr 15 '15 21:04 sunfishcode

The approach that I'm most familiar with for video is to sum whole macroblocks or subblocks (16 to 256 bytes) with vector intermediates before doing the final reduce. I'm not sure if this approach is appropriate for x86.

to implement a single unsigned int8x16.sumOfAbsoluteDifferences():

    VABDL.u8 q0, d4, d8    ; 8 differences per operation
    VABAL.u8 q0, d5, d9
    ; more sums could go here
    VPADDL.u16 q0, q0      ; 8->4
    VPADD.u32 d0, d0, d1   ; 4->2
    VPADD.u32 d0, d0, d0   ; 2->1
    VMOV.u32  r0, d0[0]

and for uint16x8 input:

    VABDL.u16 q0, d4, d8    ; 4 differences per operation
    VABAL.u16 q0, d5, d9
    ; etc.
    VPADD.u32 d0, d0, d1   ; 4->2
    VPADD.u32 d0, d0, d0   ; 2->1
    VMOV.u32  r0, d0[0]

The reductions are more than half the work, and latency-hiding rearrangements might be made if the sums covered more input.

Separating the absolute-difference and accumulate operation from the reduce operation at the js level might fit nicely, but the intermediate sums would work best in an opaque type. The code fragments above produce partial horizontal sums in an odd format even before the beginning of the reduce, and I don't expect that ARM and x86 would want to conform to a shared definition of that structure (x86 might just want a scalar in there).

Is an opaque object an appropriate solution? Where we define only what happens when a producer and consumer of the type are used together?

ghost avatar Apr 22 '15 22:04 ghost

I appologize; I've been busy and haven't reviewed this in detail yet.

Also, FWIW, an interesting analogous conversation is happening in LLVM right now: http://reviews.llvm.org/D9029

sunfishcode avatar Apr 28 '15 00:04 sunfishcode

Borrowing an example loop from there, my suggestion is basically just:

opaque_type r;
r.clear();
for (i = ...) {
    r.accumulate_sad( *a++, *b++ );
}
int sad = r.reduce_sad();

Where x86 implements opaque_type as an int, and other architectures implement it as a vector or whatever else is convenient.

ghost avatar Apr 28 '15 17:04 ghost

The opaque_type solution here looks nice, though I haven't yet determined whether this will be usable for our use cases. I do still plan on digging into this further.

It also looks like the analogous LLVM conversation has also stalled somewhat.

sunfishcode avatar May 21 '15 05:05 sunfishcode

@sunfishcode At the call yesterday, you proposed a new solution. Could you describe it here? I am not sure if I understand the details.

littledan avatar Jun 24 '15 17:06 littledan

A few reasons why I'm hesitant to add sumOfAbsoluteDifferences in Phase 1;

  • This is a very specific instruction that could be implemented in terms of other instructions, so it seems like it would need strong performance justification.
  • It seems like a motivated compiler could do a peephole optimization to recognize the sequence of instructions and generate the right instructions. Is this considered to be too hard to implement, or too unreliable for Javascript authors, or are there other disadvantages?
  • The architecture mismatch worries me. How much overhead will this have on x86 due to emulation? How much benefit does the emulated instruction give on various platforms versus implementing it in terms of simpler operations? How much faster does the instruction make video transcoding on x86 and ARM?
  • Do we have other use cases for this operation besides video transcoding, which might ultimately be subsumed by GPU processing?

I don't disagree that we want this eventually; the question in my mind is whether we should include it here or whether it'd be a better fit for the universe API or the long SIMD API (or both).

littledan avatar Jun 24 '15 18:06 littledan

My new proposal is as follows: we drop sumOfAbsoluteDifferences and add absoluteDifference. This is vabd on ARM, and on x86, it might expand to this sequence (AT&T operand ordering):

    movdqa     %xmm0, %xmm2
    psubusb     %xmm1, %xmm0
    psubusb     %xmm2, %xmm1
    por            %xmm1, %xmm0

With VEX encoding, the movdqa could be avoided.

An absoluteDifference function would support a coding idiom for SAD computation which is essentially optimal on ARM, and on x86 is still a worthwhile optimization. Full PSADDBW can be added along with many other goodies in the universe API in the future.

absoluteDifference could also be built out of other operations and pattern-matched by JITs (once we add unsigned saturating ops). However, if we're going to have fixed sequences that users have to know about and deliberately utilize in order to get good performance, it's preferable to just name those sequences.

This operation is very common in digital image processing, with numerous applications.

A long SIMD API would have most of the same operations as a short SIMD API. The difference is in the programming model, not the set of operations available. Therefore, there is no concern that this is duplicating functionality which would also be useful in the long SIMD API.

Similarly, most of SIMD.js functionality is redundant with GPU functionality. SIMD.js is nevertheless useful because GPU programming involves numerous programming tradeoffs. Therefore, there is no concern that this is duplicating functionality which would also be available from the GPU.

sunfishcode avatar Jun 25 '15 02:06 sunfishcode

Thanks @sunfishcode- this seems like a performant portable compromise. It's great that Mozilla is ensuring video / image processing can be accelerated in SIMD.js- I completely side stepped this problem for SIMD in Dart.

@littledan

JIT compilers must be as fast and simple as possible. The Dart VM's compiler is simple (in terms of not doing many optimization passes) and we are constantly getting feedback from developers that compilation times are too long. Put another way, they would rather run 10% slower than suffer longer compilation times.

One of my initial design goals for SIMD was to avoid requiring any optimization passes (pattern matching, auto vectorization) and make things as explicit as possible. I've had many people with traditional (batch, offline) compiler experience make arguments for various optimization passes being used in place of explicit functionality exposed by the API. This goes against the philosophy of SIMD in Dart and by relation, SIMD.js.

johnmccutchan avatar Jun 25 '15 14:06 johnmccutchan

OK, this doesn't seem like a bad option. I just wanted to be sure. I'll write up the spec language once the polyfill and test change is in.

However, @johnmccutchan , I do think there's a role for compilers. I see the argument that, if we want a guaranteed optimization, we should name it to get it reliably and explicitly. But I don't think typical implementations for SIMD.js will be so primitive that a peephole optimization like this would be hard to implement. V8, for one, invokes the new Turbofan compiler, which is pretty advanced in terms of its IR representation and optimizations (and this is also used for WebAssembly; it's not just in use to get good behavior on JS semantics). Basically any optimizing compiler needs to be able to apply some algebraic identities for good performance, and this one just wouldn't be so tough on top of that, as far as I can tell. The things that are needed to make a JIT compiler fast are different, for example a well-optimized parser, well-designed IR, good code caching, and maybe combined, low-iteration optimization passes, not eliminating optimization. Applying algebraic identities is different from autovectorization, which is much more unreliable and just not a natural way to do small matrix computations, for example.

On the other hand, the API design argument is more convincing. So I'm not objecting to including this instruction; I just don't believe this can be motivated by the compiler not being capable of doing optimizations.

littledan avatar Jun 25 '15 17:06 littledan

@littledan I did not say that the philosophy behind SIMD in Dart and SIMD.js was motivated by a compiler being incapable of doing optimizations- but that JIT compilers have to be fast because they are invoked every time a program is run (startup time), and repeatedly throughout the program's life cycle pausing the program's execution. Tacking on more and more optimization passes (even simple things like peep hole) slows down the compilation pipeline. Batch, offline compilers have the luxury of being able to do this without the constraint of sub-millisecond function compilations.

johnmccutchan avatar Jun 25 '15 18:06 johnmccutchan

@johnmccutchan I don't think this would be a separate pass.

littledan avatar Jun 25 '15 18:06 littledan

@littledan You're missing the point.

johnmccutchan avatar Jun 25 '15 18:06 johnmccutchan

I agree that we should design things to be efficiently compilable. That's why I like SIMD.js's design of using functions, rather than methods, for its operations, and why I pushed back on repeated requests to make SIMD vectors objects--because it is much more amenable to efficient implementation.

However, some things actually are easy for an optimizing compiler to do and won't cause performance regressions. And some things are not that easy, but we have to do them anyway, like register allocation. (SIMD.js could've been designed with an explicit register set, but I'm glad it isn't.) We should keep in mind the way that real JavaScript implementations are done, rather than assuming that everything is hard. Maybe I'm wrong about this particular thing being easy, but that's a more detailed technical argument, not as easy as the high-level point that we don't want to add a peephole optimization pass.

littledan avatar Jun 25 '15 18:06 littledan

The problem with plain-old VABD is that the results after just one iteration already contain full-range values. You can't safely add them to the result of a VABD from another row because you might get overflow.

Unless you mean to add them using saturation, of course. Would that be useful, though?

For long-running, non-saturating sums NEON can use VABAL, but each 256-bit difference consumes only 128 bits of input, so you get separate high and low half results.

Also, do we have a sum() reduction operator to get the final block score? I don't fancy 8 or 16 getter calls if the JIT can't be trusted to recognise it as a reduce (NEON can still do it in O(log n) for n lanes).

ghost avatar Jun 25 '15 21:06 ghost

One of my initial design goals for SIMD was to avoid requiring any optimization passes (pattern matching, auto vectorization) and make things as explicit as possible. I've had many people with traditional (batch, offline) compiler experience make arguments for various optimization passes being used in place of explicit functionality exposed by the API. This goes against the philosophy of SIMD in Dart and by relation, SIMD.js.

I don't have words to convey how much I second this. SIMD has only single purpose, which is performance, and therefore the performance it provides should be an explicit contract, instead of an implicit one. Requiring pattern matching or other optimization passes would mean added man-work to recognize, implement and maintain such implicit optimizations (which certainly would be not documented, even though perhaps implied while writing up the spec), and that accumulates page load/JIT times, and risks that the performance landscape will then look different on different browsers. That would compromise the only purpose that SIMD has.

juj avatar Jun 27 '15 00:06 juj

A more complete version of my proposal is now available, in #209.

sunfishcode avatar Jun 27 '15 13:06 sunfishcode

Compatible with @simhos01's observation, my full proposal in #209 includes a widening form of absoluteDifference as well as a horizontal sum function. I realize that this is adding yet more fairly specialized functions, however I believe this is still within the scope of widely useful functionality for image manipulation and decoding.

sunfishcode avatar Jul 02 '15 01:07 sunfishcode

How do you get the widened differences of the other half of the lanes? Shuffle and repeat? Or did I just overlook that function?

ARM's native approach to widening operations like this is one operation for the first half of the input vector and then another operation for the second half of the vector (true for both AArch32 and AArch64, but AArch64 follows my wording explicitly while AArch32 does it with register aliases).

ghost avatar Jul 02 '15 01:07 ghost

Right now, yes, it's shuffle and repeat. I'm open to refining this, but I think this is a good start.

sunfishcode avatar Jul 02 '15 01:07 sunfishcode

I don't know if it's really right to standardize something that's just considered a good start. Although it's not that high-risk (worst case, we have these three orphaned instructions hanging around once a better solution emerges) but what if we just standardize the stuff that is more simple and clean the first time around, and we know we want and we know optimizes on the platforms (including unsigned and saturating operations), and we leave this more advanced operation for the second round? We can always add things but we can't remove them very easily. It would be great to hear from a second potential user of this operation before standardizing it.

It sounds like this doesn't map perfectly to either ARM or x86. If we're going to have a universal API in the future, maybe this should just be there. Or, if it's always used over an array in an accumulating way, maybe there should just be a function exposed which takes a TypedArray and does this accumulation at a certain offset and length (I was calling this Long SIMD, but I guess that's not what's meant by Long SIMD). In theory, either of these would be faster, but I don't know by how much, or how much performance penalty we would get by not including the instructions at all, or what users' performance requirements are with this feature. I don't think "getting all the performance of the hardware" is enough of an answer.

littledan avatar Jul 02 '15 22:07 littledan

The next step in the TC-39 process is to request Stage 3 approval. At Stage 3, the task is to "Indicate that further refinement will require feedback from implementations", and that's exactly what we need here. The majority of the API is pretty stable at this point, and we really do need feedback from implementations on the handful of additional functions being discussed here.

To me, this isn't about "getting all the performance of the hardware". It's about enabling a very important family of use cases. It's difficult to estimate specific numbers for the API proposed here without further implementation experience, which is why I'd like to proceed and seek that experience.

sunfishcode avatar Jul 02 '15 23:07 sunfishcode

If it's really not set in stone then I agree with the principle of striving for adequacy. In my experience it can be very hard to break the cycle of lack of support because of lack of test cases and lack of test cases because of lack of support. Forward momentum is important.

But I do also have anxiety about things getting stuck halfway through and fossilizing.

ghost avatar Jul 03 '15 02:07 ghost

@sunfishcode

The majority of the API is pretty stable at this point, and we really do need feedback from implementations on the handful of additional functions being discussed here. It's difficult to estimate specific numbers for the API proposed here without further implementation experience, which is why I'd like to proceed and seek that experience.

Although obviously no committee permission is needed to get implementation experience, I guess it'll be easier to remove things from Stage 3 to Stage 4 than to add them. How would you feel about including these three functions, but also highlighting in our presentation to the committee that we are still seeking implementation feedback specifically about these three functions (as opposed to the others, which are less innovative)?

littledan avatar Jul 03 '15 18:07 littledan

Although obviously no committee permission is needed to get implementation experience, I guess it'll be easier to remove things from Stage 3 to Stage 4 than to add them. How would you feel about including these three functions, but also highlighting in our presentation to the committee that we are still seeking implementation feedback specifically about these three functions (as opposed to the others, which are less innovative)?

Yes, I'm happy to do that. Thanks!

sunfishcode avatar Jul 03 '15 19:07 sunfishcode

OK, the spec v0.7 has been updated for this, and I merged the pull request a few days ago. Let's keep the bug thread open to discuss implementation feedback, though.

littledan avatar Jul 06 '15 19:07 littledan