The kat model training time problem
Hi, I read in the paper that the gpu you guys are using is a single a5000 to train kat, what I am using is a single a6000, when I train a kat model such as kat_base, I cranked up the batchsize to 512, and it took me up to a day to train an epoch down the line, then I tried to train a smaller model: kat_ tiny, and adjusted the batchsize to 1024, it also took up to 10 hours to train an epoch, which is very time-consuming. Is this normal or am I mistaken somewhere?
Can you see your GPU utilization? This is still slow. I use 8xA5000 and train around 1-2 days for 300 epoches.
你能查看 GPU 利用率吗?这仍然很慢。我使用 8xA5000,训练大约 1-2 天,共 300 个 epoches。
My gpu utilization is shown above
Thank you for introducing the Kolmogorov-Arnold Transformer (KAT)! We have also observed similar slow training times with the KAT. In our paper, FlashKAT: Understanding and Addressing the Performance Bottlenecks in the Kolmogorov-Arnold Transformer (https://arxiv.org/abs/2505.13813), we trace the root cause of this slowdown to the group-wise rational backward pass kernel and introduce an effective solution that achieves an 86.5× speedup when training on a single H200 GPU. The corresponding code is available at https://github.com/OSU-STARLAB/FlashKAT.