clspv Proposal/RFC: full precision math builtins

Hello :wave:

I've been working on allowing clspv to generate SPIR-V that meets the full math builtin precision requirements in the OpenCL C spec. My approach to this has been to include more of libclc's builtin implementations in the clspv build files there, generate a new clspv--.bc builtins module, and use it to replace the existing one. I'm opening this issue to solicit feedback on this approach before I start trying to upstream stuff. It doesn't come without tradeoffs: adding all these new builtin implementations to the library means we don't generate the corresponding GLSL instructions for those builtins anymore, I haven't done any testing but I would assume this means the SPIR-V generated is slower on most devices. One way to account for this would be to make sure that if -cl-fast-relaxed-math is passed we always generated the glsl instruction rather than linking the more precise implementation (although CL 3.0 does tighten relaxed math precision reqs and I'm not sure if there are any cases where the VK spec doesn't meet those new requirements). Another way would be to keep the existing behaviour and lock the more precise implementations behind a new flag (--high-precision-math or something), this would also require changes to how we link the builtins module I think but it should be doable.

Since non-opaque pointers are soon to be deprecated, and the buitlins module change will require an llvm contribution, this is all gated on full opaque pointer support. I have also been (on a very casual basis) looking at adding that, but I don't have anything ready to contribute there yet.

Jul 15 '22 15:07 aarongreig

We've discussed some of this internally so I'll add the same feedback here.

When I tested libclc, it was insufficient to reach full conformance across a wide variety of devices. @rjodinchr has re-tested more recently and can comment on what works and what doesn't. I'm not sure whether #891 will fix the remaining issues or not. For this project to work, it needs to support all the major vendors (Arm, Qualcomm, Nvidia, AMD, and Intel).

From the performance POV, the NativeMathPass would definitely need extended to allow the builtin functions to be used with fast math (or -cl-native-math which is clspv-specific). As it stands, I've tried to put a disclaimer around the use of fma in the support docs because it is egregiously slow. I don't know offhand just how slow the rest of libclc would be for the remaining functions.

I'm tracking the transition to opaque pointers in #816. One of the last tasks will be generating a version of the library with opaque pointers. Both versions will probably need to exist until opaque pointers in clspv have been more thoroughly tested (e.g. against common workloads and CTS). There is a separate issue with libclc about needing overloads for generic address space pointers. This is not a hard requirement for clspv, but would be nice to support eventually. There are some comments related to this in lib/Compiler.cpp.

Jul 18 '22 19:07 alan-baker

For my point of view there are 2 important parts for this issue:

having compliant implementation that works across all the major vendors
being able to configure which implementation is the default for each architecture. For example on Intel and Nvidia, the GLSL fma implementation seems to be compliant. No need to use the libclc for those architecture.

And yes, I have tested it recently. At first I thought I managed to have it working on Intel AMD and ARM. But it seems that my setup was buggy. In the end, I still have a few functions that are not working on every architecture:

cos / half_cos
sin / half_sin
tan / half_tan
sincos
rsqrt
pown

The issue with the trigonometric ones seems to be around the argument reduction (which aims at reducing the input into a smaller ranger like [pi, -pi]). I manage to have them working on Intel by forcing fma to use the GLSL implementation, but this is not a good solution for every platform (which comes back to the idea of being able to have different default implementation for each functions for each architecture). rsqrt is pretty weird as I have divide and sqrt working on all architecture (with #891). I have not looked at pown.

Jul 19 '22 11:07 rjodinchr

about rsqrt:

Before the compliant division here what the produced SPIRV code looked like:

%40 = OpExtInst %float %39 Sqrt %35
%41 = OpFDiv %float %float_1 %40

From vulkan specification about precision operation, it seems that most of them might be implementing Sqrt = 1 / inverseSqrt.

I suspect the vulkan driver will optimize the code generated by clspv to end up with only a call to inverseSqrt and no division.

With the compliant division, there is a bit more logic before the division itself. It should prevent the vulkan driver from optimizing it, thus ending up with 2 divisions : rSqrt = 1 / Sqrt = 1 / ( 1 / inverseSqrt). This leads to an accumulation of errors that is making the rsqrt test fail.

I see 2 solutions to solve this test:

implement rsqrt = native_divide(1, sqrt) which allows to go back to the previous working implementation
implement rsqrt = inverseSqrt which should be fine on every platform as vulkan requirement for it is the same as OpenCL.

Jul 28 '22 09:07 rjodinchr

Thanks for the feedback!

For example on Intel and Nvidia, the GLSL fma implementation seems to be compliant.

I manage to have them working on Intel by forcing fma to use the GLSL implementation, but this is not a good solution for every platform

I've seen this as well. It doesn't get to the root of the problem with the trig functions but it's very tempting to suggest adding some kind of -use-native-fma flag to avoid using the libclc implementation. Accuracy issues notwithstanding it would still be nice to be able to avoid the slowness of libclc's fma if you want, since it impacts other builtins that use it in their implementations as well. I'm happy to work on adding that if there are no major objections from a design standpoint.

The biggest issue with it I can think of is that "add more flags" isn't a very scalable way to go about solving this kind of problem, and it certainly isn't a user-friendly way to go about it either. The counterpoint that comes to mind is that I'd rather see this kind of tweak gated on an explicit flag than have it be automatic (and therefore opaque) when I pass --platform=nvidia or similar. I'm also coming at this from the perspective of using the tool within a layered OpenCL implementation though, so configuring a bunch of flags to get the SPIR-V generation just right is something we're already dealing with in that particular use case.

Aug 09 '22 11:08 aarongreig

I have discussed that part with @kpet. The idea was that clspv could stay low level, and have a way to specify which math builtins should take their implementation from the libclc or from the native version. As it could apply to "all" math builtins, having a flag per builtin does not seem appropriate. After that we could have some kind of profile in clvk which would describe for each platform which flag needs to be used.

Also one "issue" with the libclc is that at the moment, if you have an implementation of the fma, it will inline it everywhere. Which does not allow to change the implementation at compile time (I mean kernel compile time).

Aug 09 '22 12:08 rjodinchr

Ah I didn't realize fma was getting inlined everywhere, I'm looking to see if there's some hint or attribute I could patch in to prevent that - I don't see much benefit to doing that kind of optimization on a builtins library before it gets linked.

Just so I understand, what you're proposing is something along the lines of: --native_builtins="fma,divide,foo,bar". Not that exact interface necessarily, but that general idea? If so I think that makes sense, I'll start thinking about how the mechanics of it might work

Aug 09 '22 13:08 aarongreig

Yes something like that

Aug 09 '22 14:08 rjodinchr

This should go some way to prevent the unwanted inlining https://reviews.llvm.org/D132362

Aug 22 '22 10:08 aarongreig

I've opened a revision to add generic implementations for builtins that failed full bruteforce on the devices I'm testing with https://reviews.llvm.org/D134887

Sep 29 '22 15:09 aarongreig

We need soft implementations for a lot of the conversion modes to be conformant there, turns out adding them is pretty easy https://reviews.llvm.org/D136772. Note this does make linking a bit slower, I could maybe figure out a way to be more selective with it

Oct 26 '22 16:10 aarongreig

When I last looked at libclc conversions, there were some obvious bugs in some of the rounding modes, but I got pushback on trying to fix them. Let's make sure we can pass CTS before these get added clspv. I wonder if it would worth investigating using a tablegen approach like the headers instead having this separate library/project. I haven't thought too much about it, but I already have difficulties with libclc as a separate subproject in llvm.

Oct 26 '22 17:10 alan-baker

Those conversion modes will also be needed for vstore_half_(rte|rtz|...) and vload_half_(rte|rtz|...).

Oct 27 '22 09:10 rjodinchr

We do see CTS fails with a couple of the conversions but we're still working to figure out exactly where in the stack the issues are, the SPIR-V looks fine at least. Will ping for reviews when I've got 100% on CTS with the devices I'm testing

Oct 27 '22 13:10 aarongreig

I asked Kévin on the clvk discord to see if he'd be willing to land some of the libclc revisions I've opened, namely https://reviews.llvm.org/D134887 and https://reviews.llvm.org/D134040, he said he would but suggested I ping you about them beforehand @alan-baker

Oct 27 '22 15:10 aarongreig

Keen to get https://reviews.llvm.org/D134887 https://reviews.llvm.org/D134040 and https://reviews.llvm.org/D132362 merged soon, they're relatively small revisions that improve the math situation substantially

Dec 01 '22 15:12 aarongreig

We've been doing more testing with the libclc conversions and we've pretty definitively determined that the two issues we have are compiler backend related, the SPIR-V generated is correct. Besides the 34 tests affected by those issues we're passing the full conversions cts test with the patch https://reviews.llvm.org/D136772

Dec 08 '22 11:12 aarongreig

Those revisions all look ok to me. I'm glad to hear there aren't more issues.

Dec 08 '22 14:12 alan-baker

I think we can now close this issue.

We still have some fails in google3, but the issue seems to be on the infra side. Running the tests manually pass.

Apr 18 '23 12:04 rjodinchr

@rjodinchr one question, given this change went in, is this still accurate (https://github.com/google/clspv/blob/main/docs/OpenCLCOnVulkan.md#built-in-functions) i.e. "This means kernels will operate as if compiled with --cl-fast-relaxed-math" and/or should we need to update the CLVK to not pass this option any more to CLSPV.

thoughts?

Apr 18 '23 18:04 lpavank

This part of the documentation is not up-to-date.

At the moment, the default behaviour is to use OpenCL compliant implementation. Those are coming from libclc. But for some functions, another implementation is available (through GLSL builtins). Those can be used instead of libclc and will provide better performances. Sometimes they are even OpenCL compliant. To use them, clspv has a --use-native-builtins option. clvk is using this option to set some builtin which GLSL implementation is compliant and thus should be used by default.

Apr 18 '23 21:04 rjodinchr