llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

utilities for mixed-precision tests/benchmarks

Open ngc92 opened this issue 1 year ago • 2 comments

This allows us to compile a single executable that can serve as test/benchmark for f32, f16, and bf16 versions of the kernels. So far, I've updated only those test files which already defined a BF16 macro.

Caveat: This will try to compile float, half, and bfloat16 versions into a single exe, so the compilation fails if any of these isn't available at the moment. This is something we need to improve at some point, once we have a general strategy in place how to handle older hardware.

ngc92 avatar May 04 '24 15:05 ngc92

This complicates dev/cuda quite a bit, with templates and macros, both a bit scary. What is the problem that it is trying to solve? Isn't it the case that our CI could just compile all the kernels separately for all precisions we care about and test them one by one?

karpathy avatar May 05 '24 22:05 karpathy

it's less about automatic testing, and more about human testing and profiling; where I find it quite convenient not having to recompile the tests for each precision. And about reducing duplication between the different kernel test files; not letting get things out of sync.

Personally, I find the template solution much cleaner than moving the ifdefs into common.h and having floatX magically appear from there, but that would also be a solution to the problem.

If you don't like the PR in its entirety, there are still some individual things that should be merged; e.g., all the napkin math needs to be updated to actually reflect the floatX type's size.

ngc92 avatar May 06 '24 18:05 ngc92

will avoid for now, closing.

karpathy avatar May 10 '24 17:05 karpathy