clspv Adding FP16 math builtins

This is a libclc issue but I think the discussion is also worth having here. I've been working on a couple of ways we could add fp16 implementations of the OpenCL math builtins to the tool:

upcast implementations based on the existing float builtin implementations
ported implementations from AMD's rocm libraries

The upcast implementations can largely be generated with a simple macro, with some additional logic needed for handling edge cases. I've made this work for a bunch of cases but it is a noticeably slow way to do things.

The rocm builtins also need edge case related tweaking to pass the new fp16 tests in the CTS (not merged yet, branch here) but they do seem quicker than the upcast approach (more testing to make sure this isn't just confirmation bias is something I'm still working on). The downside of them is that they are someone else's code. To my untrained eye the licensing situation seems fine: rocm is currently licensed under the same university of Illinois license that LLVM used to be licensed under, and libclc already has a bunch of AMD copyright lines in it so including them in the new builtins shouldn't be an issue... but yeah more due diligence and coordination will be needed to make sure a porting effort is 100% fine and legal.

I'm currently leaning towards the rocm approach, trading more work for better performance (that may change if my assumptions about how much faster it is are proven wrong); so I'm soliciting objections to moving forward with that, with the reassurance that there is another way to go about it if there are any major objections. I'd also be interested in any other possible solutions I could look into, like other places permissively licensed implementations might be found.

Sep 06 '22 16:09 aarongreig

Note that for float/half conversions, we are sometimes using spirv.pack.v2f16 and spirv.unpack.v2f16: https://github.com/google/clspv/blob/main/lib/Constants.cpp#L62 This is moslty for the vload_halfn at the moment I think.

Sep 07 '22 06:09 rjodinchr

Is the story of accuracy roughly the same at fp16 as fp32 for Vulkan devices? Vulkan calls out some differences, but I haven't looked too hard at OpenCL's requirements. Do devices exhibit the same profiles of which functions require a polyfill?

Sep 15 '22 18:09 alan-baker

Yeah it's similar in that many are defined differently (usually vulkan defines the accuracy in terms of other functions, and occasionally it uses absolute error within certain ranges), and of course some functions are just missing from Vulkan entirely. I've got a couple of devices I can test on now: my desktop's intel CPU and a galaxy s22, the subset of functions I've added implementations for locally are just the ones that fail the new tests on either of them (more often than not both).

Sep 16 '22 09:09 aarongreig

opened a revision to add fp16 implementations https://reviews.llvm.org/D135268

incidentally, does anyone know who to contact about getting reviewed libclc revisions committed?

Oct 05 '22 14:10 aarongreig

Ha turns out the issue I was thinking was fixed previously and I reviewed it. https://reviews.llvm.org/D81999 for reference.

Oct 27 '22 14:10 alan-baker