llm.c
llm.c copied to clipboard
slightly faster gelu on smaller blocksize contexts
gelu implementation.
process in chunks + usage of hardware intrinsics for slightly faster performance.
Results: (original implementation) GPU cuda_gelu execution times: [13.78, 13.26, 13.95, 13.0, 12.93, 13.46] (second implementation) GPU cuda_gelu2 execution times: [13.14, 13.0, 13.49, 13.62, 13.22, 12.84]
GPU cuda_gelu mean execution time: 13.4 ms GPU cuda_gelu2 mean execution time: 13.22 ms
Benchmarked here: https://colab.research.google.com/drive/1Ci2E_A2KMeUxQg05JZoQJIf9b4TXfK0v?usp=sharing