s4
s4 copied to clipboard
Faster Cauchy Kernels?
Breathtaking work, absolutely amazing application of linear algebra. Beautiful.
A few questions.
Appendix of S4 article mentions "... implementation of S4 uses the naive O(NL) algorithm ..._ " and README.md mentions custom kernels.
Question 0. Did you benchmark the naive O(NL) against the custom kernel?
Question 1. Is the 60x speedup in Table 8 with naive O(NL) or custom kernel? Or is the custom kernel only used during training?
Question 2. How big a percentage of compute is spent on S4 compared to mlp/lnorm/others in generation mode?
Apologies for any misunderstandings
- The naive algorithm and custom kernel primarily differ in memory usage. The naive algorithm materializes the Cauchy matrix which needs O(NL) ops and O(NL) space, while the custom kernel reduces the space to O(N+L). We did benchmark these to verify the space savings
- Yes, the kernel is only used during training. The Cauchy kernel is only used to compute the convolution kernel K bar (equation 5). In settings where we don't use the convolution mode - such as Table 8 which is about autoregressive generation using recurrent mode - the kernel is not used. Table 8 shows the speedup achieved from using recurrence
- In recurrent mode, I think most of the ops are roughly equally expensive; the MLP and other parts might even be dominant. The S4 part is just a simple matmul, like an RNN