KAN-benchmarking Benchmark results

Benchmark results

Open Da1sypetals opened this issue 9 months ago • 2 comments

I benchmarked my implementation of ChebyKAN written in CUDA. The code runs slower than pure pytorch version and uses more VRAM, probably because the cuda code is not optimized.

I added learnable parameter $k, b$ replacing $\tanh(x)$ with $\tanh(kx+b)$.

Update:

coalescing memory access (by altering dimension layout) made major improvements. In the table, df stands for degree-first, which has a similar meaning like batch-first.

Results:


              |      forward  |     backward  |      forward  |     backward  |   num params  |  num trainable params
-----------------------------------------------------------------------------------------------------------------------
chebykan-gpu  |     53.51 ms  |     54.39 ms  |      1.94 GB  |      2.06 GB  |    167792640  |             167792640
cucheby-gpu   |     53.51 ms  |     51.20 ms  |      2.11 GB  |      2.24 GB  |    167792650  |             167792650
dfCu-gpu      |     20.72 ms  |     51.57 ms  |      1.38 GB  |      1.54 GB  |    167792650  |             167792650

The first row comes from the implementation here, while the last two are my implementations.

parameters are:

batch size = 256
network: 5 layers, [2048, 2048, 2048, 2048, 2048, 1]
num repetitions = 100

My implementation: cuChebyKAN I am willing to learn any CUDA optimization tricks and happy to receive suggestions, and willing to receive optimization suggestions on my implementation 😄

May 15 '24 06:05 Da1sypetals

KAN-benchmarking KAN-benchmarking copied to clipboard

Benchmark results

Update:

KAN-benchmarking
KAN-benchmarking copied to clipboard