utilities for mixed-precision tests/benchmarks
This allows us to compile a single executable that can serve as test/benchmark for f32, f16, and bf16 versions of the kernels. So far, I've updated only those test files which already defined a BF16 macro.
Caveat: This will try to compile float, half, and bfloat16 versions into a single exe, so the compilation fails if any of these isn't available at the moment. This is something we need to improve at some point, once we have a general strategy in place how to handle older hardware.
This complicates dev/cuda quite a bit, with templates and macros, both a bit scary. What is the problem that it is trying to solve? Isn't it the case that our CI could just compile all the kernels separately for all precisions we care about and test them one by one?
it's less about automatic testing, and more about human testing and profiling; where I find it quite convenient not having to recompile the tests for each precision. And about reducing duplication between the different kernel test files; not letting get things out of sync.
Personally, I find the template solution much cleaner than moving the ifdefs into common.h and having floatX magically appear from there, but that would also be a solution to the problem.
If you don't like the PR in its entirety, there are still some individual things that should be merged; e.g., all the napkin math needs to be updated to actually reflect the floatX type's size.
will avoid for now, closing.