simd icon indicating copy to clipboard operation
simd copied to clipboard

wasm_simd128.h intrinsic naming

Open tlively opened this issue 5 years ago • 41 comments

Currently the intrinsic header uses a different naming convention than the underlying SIMD instructions. The intrinsic name can be determined from the instruction name by prepending wasm_, turning the . into an _, and removing any signedness suffix in favor of representing unsigned interpretations by using a u instead of an i in the initial interpretation specifier.

For example i32x4.lt_u becomes wasm_u32x4_lt and i32x4.lt_s becomes wasm_i32x4_lt.

I don't think we ever discussed this naming convention as a group, so with the recent instruction renamings, now is a good time to make sure there are no better alternatives before everything stabilizes.

One possibility, for example, would be to get rid of the us in lane interpretations and use the same _u and _s prefixes as the instructions. What other ideas do we have? Or do we like the current scheme?

cc @zeux.

tlively avatar Sep 10 '20 17:09 tlively

I like the current naming conventions, and have plenty of code written against current version of wasm_simd128.h. So it you do change, please keep the old names (at least for a while) for backward compatibility.

Maratyszcza avatar Sep 18 '20 22:09 Maratyszcza

I also like the current conventions just fine, but will also be happy with explicit _s suffix. However note that the intrinsics also include things like v8x16 prefixes that are being changed in the instruction naming. So it would be great to understand what is the full set of differences between current names and the outlined algorithm in the issue after the renames take place, because I think it's now more complicated than just "remove signed suffix"

zeux avatar Sep 18 '20 22:09 zeux

Oh yeah, I guess I was implicitly assuming that we should update the v8x16 prefixes. I can keep the old names for a bit for those as well as @Maratyszcza suggested.

tlively avatar Sep 18 '20 23:09 tlively

Is the the "flexible" SIMD stuff about a vector length agnostic proposal (similar to SVE or the RISC-V V spec)?

If that's the case and there will never be explicit 256-bit (or anything else) vectors in WASM SIMD (which I think is the right decision), one possibility would be to drop the lane count. For example, i32x4.lt_u becomes wasm_u32_lt and i32x4.lt_s becomes wasm_i32_lt? I've been thinking about switching WAV to do that, but held off mainly because I was worried about future 256-bit vectors.

One possible wrinkle here is that everything is in the wasm_ namespace instead of something dedicated to WASM SIMD, which could confuse people. Personally, I'd like to see everything moved into a different namespace (maybe wasmv or wv)…

nemequ avatar Mar 17 '21 17:03 nemequ

Yes, so far there is no intention to provide 256-bit fixed-width operations.

This is a good question - really it depends on operop between the two and requirements of defining user-facing intrinsics. Native intrinsics typically have different versions for different ISA extensions, even if they do the same thing, I am not sure what path Wasm would take here.

BTW, if you are curious on how flexible proposal would handle some of this we should definitely talk about in that repository.

penzn avatar Mar 17 '21 17:03 penzn

I hadn't really considered the possibility of using the same API for flexible vectors, but I think there are some obstacles which would make that difficult. For example, you probably want an additional parameter on most functions for the predicate (unless you want to get rid of the predicate in the API). I'd prefer to see a separate namespace for flexible vectors, but having (for example) wasm_i32x4_add for 128-bit and wasm_i32_add for flexible could work.

My concern was more about the possibility of future functions like v256_t wasm_i32x8_add(v256_t, v256_t). IMHO that would be a mistake; only x86 has fixed 256-bit vectors, everyone else seems to be moving directly from 128-bit to flexible. I think it would make sense for WASM to do the same unless it really wants what would effectively be an x86-only API.

If the flexible SIMD would use a different namespace and there isn't going to be a fixed 256-bit SIMD extension, I think dropping the lane counts from the API would make code using it a lot more readable (and easier to write, but I care less about that). Otherwise it could still be done by overloading the functions, which I would be good with but others may find distasteful…

BTW, if you are curious on how flexible proposal would handle some of this we should definitely talk about in that repository.

I am very interested in the C/C++ side, yes. I'll watch the repo.

nemequ avatar Mar 17 '21 18:03 nemequ

I'd say flexible SIMD is highly likely to use a different namespace - that is how things usually go with the toolchain.

My concern was more about the possibility of future functions like v256_t wasm_i32x8_add(v256_t, v256_t). IMHO that would be a mistake; only x86 has fixed 256-bit vectors, everyone else seems to be moving directly from 128-bit to flexible. I think it would make sense for WASM to do the same unless it really wants what would effectively be an x86-only API.

If the flexible SIMD would use a different namespace and there isn't going to be a fixed 256-bit SIMD extension, I think dropping the lane counts from the API would make code using it a lot more readable (and easier to write, but I care less about that).

You are right - adding them would be beneficial for x86 at expense of everybody else, which did not make a lot of sense. To my best knowledge there are no proposals to introduce 256-bit ops as of today.

BTW, flexible vectors take approach somewhat in the same direction as WAV - there are types for lanes, instead of a generic 'vector' type. Also, manipulating lane count (which a few people would like to see gone) is done via a separate instruction, rather than through an extra parameter, which means that the core API would not need a lane count argument.

penzn avatar Mar 17 '21 22:03 penzn

I just added a lot of tests to the tests branch of WAV to verify that everything compiles down to exactly what we expect. Functions which don't generate the expected result have been annotated in wav.h (in the main branch) as requiring unimplemented-simd128 (grep for "_UNIMPLEMENTED"). I think this should serve as a pretty good list for you, though it doesn't incorporate your change from yesterday.

The exception is the const functions, which generate splat + replace_lane.

One other thing I noticed is that there are 4 different builtins for any_true, but the spec only has a single variant (which makes sense).

Also, if you mask the count on the shift functions in C (i.e., vec << (count & 15)) you'll see the mask operation in the output even though the shifts (theoretically) do this internally.

nemequ avatar Mar 19 '21 17:03 nemequ

@nemequ, I landed a patch in LLVM yesterday to remove the unimplemented-simd128 target feature and the implicit definitions of the corresponding feature macro, but of course that shouldn't stop you from using the same macro for your own purposes. That same patch should enable v128.const by default.

I expect to remove the current any_true_* intrinsics and replace them with a single any_true_v128 intrinsic to mirror the change we made to the instructions.

I submitted https://bugs.llvm.org/show_bug.cgi?id=49655 to track the suboptimal shift issue.

tlively avatar Mar 19 '21 18:03 tlively

I just uploaded https://reviews.llvm.org/D101112 to add the missing intrinsics and update the existing intrinsics to match the new instruction names. The old names are kept as deprecated functions to ease transitioning to the new names. Reviews welcome!

tlively avatar Apr 22 '21 21:04 tlively

@tlively The updated intrinsics look good. I'd like to have const_splat versions though.

Maratyszcza avatar Apr 23 '21 15:04 Maratyszcza

@Maratyszcza, if you do e.g. wasm_i8x16_splat(42), it will actually give you a v128.const i8x16 42, 42, 42, .... Is that sufficient or would having a more explicit intrinsic for achieving that be worth it?

tlively avatar Apr 23 '21 22:04 tlively

It is enough for me, but is probably a bad idea overall: polymorphic macros/functions are not portable to older C codebases.

Maratyszcza avatar Apr 24 '21 01:04 Maratyszcza

This isn't polymorphism in the clang front-end; it's the backend instruction selection detecting that a splat of a constant is better represented with a v128.const than with a splat instruction.

tlively avatar Apr 26 '21 17:04 tlively

Then it is not enough: other compilers may not have the same optimization. E.g. last time I checked, Microsoft Compiler happily generated multi-instruction sequence for _mm_set1_epi32(const) on x86.

Maratyszcza avatar Apr 28 '21 10:04 Maratyszcza

The first parameter to __builtin_wasm_load*_lane should be const, and wasm_v128_load*_lane shouldn't cast away the const.

__builtin_wasm_abs_i64x2 doesn't generate i64x2.abs https://godbolt.org/z/jY1hf3Pch

Other than that, my biggest concern is that __builtin_wasm_load*_lane are basically unusable for me. If the lane parameter isn't an integer constant expression the compiler will emit an error, so the API has to use a macro. In C I can use a generic selection to choose the right function, but that's not an option in C++. Handling the obvious pattern would fix the issue.

nemequ avatar Apr 30 '21 01:04 nemequ

@tlively let me know if you want me to look into any of this.

penzn avatar May 01 '21 00:05 penzn

Thanks for the feedback @Maratyszcza and @nemequ, and thanks for the offer of help @penzn! I'm currently working on adding a test upstream in clang to directly test the codegen for all of the intrinsics to catch the i64x2.abs issue @nemequ identified and other similar issues.

@penzn, if you want to look into replacing the macro intrinsics with functions using the pattern @nemequ shared, that would be really helpful and would not clash with my current work. If you're interested, I would also appreciate help fixing codegen issues after my test lands.

tlively avatar May 01 '21 01:05 tlively

Another one: wasm_load32_zero and wasm_load64_zero don't generate the right instructions. __builtin_wasm_load32_zero and __builtin_wasm_load64_zero work, which is why I didn't catch this earlier, but wasm_simd128.h doesn't use them. Of course I'd rather LLVM recognized the existing implementations instead of requiring the intrinsics…

Also, the input to __builtin_wasm_load32_zero and __builtin_wasm_lod64_zero should be const.

FWIW, I should have a bunch of examples where codegen can be improved soon. My plan is to update the WASM SIMD128 implementation in SIMDe to match the current wasm_simd128.h over the weekend, then port the (verified correct) implementations to WAV so I can use the tests I have there to easily find implementations for which LLVM doesn't generate optimal code (i.e., a single instruction). I'll probably file them in the LLVM bugzilla next week. Sorry, it's going to be a lot of work on the LLVM side, but I think it should be pretty helpful for improving autovectorization.

nemequ avatar May 01 '21 03:05 nemequ

Excellent, I look forward to the reports. Thanks for your work on this!

tlively avatar May 01 '21 05:05 tlively

I have the codegen tests up for review here: https://reviews.llvm.org/D101684. @nemequ, could you take a look through the FIXMEs there and make sure they line up with the issues you've seen?

tlively avatar May 01 '21 05:05 tlively

LGTM. They include all the issues I've seen; your list has a few more because WAV uses the builtins directly (unless I can avoid them altogether and still get the right instructions) instead of going through wasm_simd128.h, and most of these seem to be bugs in the header.

I did just notice that the pointer parameters for __builtin_wasm_load*_lane aren't const but should be. Since wasm_v128_load*_lane are macros the diagnostic is emitted when using wasm_simd128.h: https://godbolt.org/z/e5c84vb17. I'm working around it in WAV by disabling the warning, which wasm_simd128.h could do, but fixing the builtin seems like a better solution.

nemequ avatar May 01 '21 16:05 nemequ

Argh, one last thing, sorry: the inputs to __builtin_wasm_narrow_u_i8x16_i16x8 and __builtin_wasm_narrow_u_i16x8_i32x4 should be signed.

nemequ avatar May 04 '21 03:05 nemequ

@tlively One more thing: the wasm_v128_load*_lane/wasm_v128_store*_lane APIs should use const void*/void* rather than strongly typed pointers.

Maratyszcza avatar May 05 '21 01:05 Maratyszcza

Looks like the tests need to be re-done in terms of C->IR->Wasm. @tlively since this is more work, let me know if you need help with that as well, as a baseline it should be possible to take the tests you have written and turn them into two files for two stages.

penzn avatar May 05 '21 01:05 penzn

Thanks, @penzn! I've already landed the C-> IR test here: https://reviews.llvm.org/rGf3b769e82ff3340e96c2e50ac2bd6361fbc3d797. I don't think we'll want a specific IR->Wasm test file that matches the intrinsics, though. I still have my end-to-end test in a separate branch to help me identify specific backend issues, though.

For example, https://reviews.llvm.org/rG14ca2e5e22e7806c77ca3e5b126e888c9b1c4041 should fix the bug with i64x2.abs not being generated.

tlively avatar May 05 '21 02:05 tlively

How would folks feel about changing the return type of the extract_lane functions and parameters to replace_lane functions to be the specific element types, e.g. int8_t rather than just int? The original motivation for returning int is that the underlying instructions consume and produce i32 values, but that's not consistent with our use of more specific types in other places like the arguments to the splat functions.

tlively avatar May 05 '21 04:05 tlively

How would folks feel about changing the return type of the extract_lane functions and parameters to replace_lane functions to be the specific element types, e.g. int8_t rather than just int?

I'm a little worried about the C++ compiler inserting redundant extend instructions in both cases, since the underlying instructions consume and return int. I expect this is mostly not a problem, though it is possibly ABI-dependent.

(Consistency is good, but could it be that it's the arguments to the splat functions that should be changed to int instead?)

lars-t-hansen avatar May 05 '21 06:05 lars-t-hansen

Yes, it would also be possible to change the arguments to splats to be ints, but IMO the more specific types give a better developer experience. I can add tests that check that no extra extend instructions are emitted, if that sounds good.

tlively avatar May 05 '21 06:05 tlively

Yes, it would also be possible to change the arguments to splats to be ints, but IMO the more specific types give a better developer experience. I can add tests that check that no extra extend instructions are emitted, if that sounds good.

Tests are good... Still wondering if this might be an issue in corner cases, but having a hard time constructing concrete examples for it, depends on compiler internals. In principle though, something as simple as (i32.add (i8x16.extract_lane n ...) ...) (transliterated to obvious C++) should tell us something, and I'd settle for some obvious tests like that.

lars-t-hansen avatar May 05 '21 07:05 lars-t-hansen