Tri Dao
Tri Dao
Thanks, we'll fix that in the README. Mamba 130m has 24 layers, matching Transformers with 12 layers.
> Thank you for the detailed reply! I found that the smallest mamba-130m model uses 24 layers instead of 12, according to the config. Is this a case for the...
We compared attention time (softmax(QK^T)V) vs scan time, without the linear projection. The dimensions are different, e.g. in a 1.3B model Transformers would typically have Q, K, V of hidden...
Q, K, V are bf16 for attention. u, delta, B, C, z are bf16, A and D are fp32 for scan.
Try `selective_scan_fn(u, delta, A, B, C, D)` (no z, delta_bias, delta_softplus) to see if that makes a difference?
We do not have experience with ROCm, but ofc we'd welcome community contribution on this
There's CUDA code in causal_conv1d but that's optional, we can use torch's conv1d. There's CUDA code in this repo for the selective_scan operation (`csrc`) and maybe it can work w...
We've never tried with Windows and idk much about compilation on Windows. Lmk if you figure out.
Soon :D
We have not tried quantization, it's an open question. Would be very interesting to understand how sensitive the model is to the SSM params. E.g. I could imagine quantizing the...