KAN-benchmarking
KAN-benchmarking copied to clipboard
Benchmark results
I benchmarked my implementation of ChebyKAN written in CUDA. The code runs slower than pure pytorch version and uses more VRAM, probably because the cuda code is not optimized.
- I added learnable parameter $k, b$ replacing $\tanh(x)$ with $\tanh(kx+b)$.
Update:
coalescing memory access (by altering dimension layout) made major improvements. In the table, df stands for degree-first, which has a similar meaning like batch-first.
Results:
| forward | backward | forward | backward | num params | num trainable params
-----------------------------------------------------------------------------------------------------------------------
chebykan-gpu | 53.51 ms | 54.39 ms | 1.94 GB | 2.06 GB | 167792640 | 167792640
cucheby-gpu | 53.51 ms | 51.20 ms | 2.11 GB | 2.24 GB | 167792650 | 167792650
dfCu-gpu | 20.72 ms | 51.57 ms | 1.38 GB | 1.54 GB | 167792650 | 167792650
The first row comes from the implementation here, while the last two are my implementations.
parameters are:
batch size = 256
network: 5 layers, [2048, 2048, 2048, 2048, 2048, 1]
num repetitions = 100
My implementation: cuChebyKAN I am willing to learn any CUDA optimization tricks and happy to receive suggestions, and willing to receive optimization suggestions on my implementation 😄