llm.c
llm.c copied to clipboard
Add `repkv_backward_kernel2` and `repkv_kernel2` (llama3 branch)
Changes
Add repkv_backward_kernel2
- improve
repkv_backward_kernel1
by reducing thread used per @karpathy's suggestion
Also add repkv_kernel2
simiar to backward_kernel2
Here is the test output for repkv_backward_kernel2
# ./repkv_backward 2 │
Using kernel 2 │
Checking block size 32. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 64. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 128. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 256. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 512. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 1024. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
All results match. Starting benchmarks. │
│
block_size 32 time 1.8824 ms │
block_size 64 time 0.9740 ms │
block_size 128 time 0.9716 ms │
block_size 256 time 0.9740 ms │
block_size 512 time 1.0151 ms │
block_size 1024 time 1.0725 ms
Execution time is improved compared to kernel1
time shown below from previous PR (https://github.com/karpathy/llm.c/pull/764)
All results match. Starting benchmarks.
block_size 32 time 3.2461 ms
block_size 64 time 1.7509 ms
block_size 128 time 1.7374 ms
block_size 256 time 1.7441 ms
block_size 512 time 1.8092 ms
block_size 1024 time 2.0443 ms
Here is the test output for repkv_kernel2
# ./repkv 2 │
Using kernel 2 │
Checking block size 32. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 64. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 128. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 256. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 512. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
Checking block size 1024. │
0.680375 0.680375 │
-0.211234 -0.211234 │
0.566198 0.566198 │
0.596880 0.596880 │
0.823295 0.823295 │
All results match. Starting benchmarks. │
│
block_size 32 time 1.7765 ms │
block_size 64 time 0.9856 ms │
block_size 128 time 0.9781 ms │
block_size 256 time 0.9887 ms │
block_size 512 time 1.0429 ms │
block_size 1024 time 1.1434 ms
Execution time is improved compared to kernel1
block_size 32 time 3.6582 ms │
block_size 64 time 1.5909 ms │
block_size 128 time 1.5868 ms │
block_size 256 time 1.5798 ms │
block_size 512 time 1.6164 ms │
block_size 1024 time 1.8981 ms