llm.c
llm.c copied to clipboard
float4 with better vectorization for adamw.cu
On 3070, Kernel 2 time gpu 0.0799 ms time cpu 0.0168 ms
Kernel 3 time gpu 0.0780 ms time cpu 0.0166 ms