Adding FP16 math builtins
This is a libclc issue but I think the discussion is also worth having here. I've been working on a couple of ways we could add fp16 implementations of the OpenCL math builtins to the tool:
- upcast implementations based on the existing
floatbuiltin implementations - ported implementations from AMD's rocm libraries
The upcast implementations can largely be generated with a simple macro, with some additional logic needed for handling edge cases. I've made this work for a bunch of cases but it is a noticeably slow way to do things.
The rocm builtins also need edge case related tweaking to pass the new fp16 tests in the CTS (not merged yet, branch here) but they do seem quicker than the upcast approach (more testing to make sure this isn't just confirmation bias is something I'm still working on). The downside of them is that they are someone else's code. To my untrained eye the licensing situation seems fine: rocm is currently licensed under the same university of Illinois license that LLVM used to be licensed under, and libclc already has a bunch of AMD copyright lines in it so including them in the new builtins shouldn't be an issue... but yeah more due diligence and coordination will be needed to make sure a porting effort is 100% fine and legal.
I'm currently leaning towards the rocm approach, trading more work for better performance (that may change if my assumptions about how much faster it is are proven wrong); so I'm soliciting objections to moving forward with that, with the reassurance that there is another way to go about it if there are any major objections. I'd also be interested in any other possible solutions I could look into, like other places permissively licensed implementations might be found.
Note that for float/half conversions, we are sometimes using spirv.pack.v2f16 and spirv.unpack.v2f16:
https://github.com/google/clspv/blob/main/lib/Constants.cpp#L62
This is moslty for the vload_halfn at the moment I think.
Is the story of accuracy roughly the same at fp16 as fp32 for Vulkan devices? Vulkan calls out some differences, but I haven't looked too hard at OpenCL's requirements. Do devices exhibit the same profiles of which functions require a polyfill?
Yeah it's similar in that many are defined differently (usually vulkan defines the accuracy in terms of other functions, and occasionally it uses absolute error within certain ranges), and of course some functions are just missing from Vulkan entirely. I've got a couple of devices I can test on now: my desktop's intel CPU and a galaxy s22, the subset of functions I've added implementations for locally are just the ones that fail the new tests on either of them (more often than not both).
opened a revision to add fp16 implementations https://reviews.llvm.org/D135268
incidentally, does anyone know who to contact about getting reviewed libclc revisions committed?
Ha turns out the issue I was thinking was fixed previously and I reviewed it. https://reviews.llvm.org/D81999 for reference.