llm.c slightly faster gelu on smaller blocksize contexts

slightly faster gelu on smaller blocksize contexts

Open AndreSlavescu opened this issue 10 months ago • 0 comments

gelu implementation.

process in chunks + usage of hardware intrinsics for slightly faster performance.

Results: (original implementation) GPU cuda_gelu execution times: [13.78, 13.26, 13.95, 13.0, 12.93, 13.46] (second implementation) GPU cuda_gelu2 execution times: [13.14, 13.0, 13.49, 13.62, 13.22, 12.84]

GPU cuda_gelu mean execution time: 13.4 ms GPU cuda_gelu2 mean execution time: 13.22 ms

Benchmarked here: https://colab.research.google.com/drive/1Ci2E_A2KMeUxQg05JZoQJIf9b4TXfK0v?usp=sharing

Apr 11 '24 16:04 AndreSlavescu

llm.c llm.c copied to clipboard

slightly faster gelu on smaller blocksize contexts

llm.c
llm.c copied to clipboard