Tri Dao

Results 250 comments of Tri Dao

Thanks, we'll fix that in the README. Mamba 130m has 24 layers, matching Transformers with 12 layers.

> Thank you for the detailed reply! I found that the smallest mamba-130m model uses 24 layers instead of 12, according to the config. Is this a case for the...

We compared attention time (softmax(QK^T)V) vs scan time, without the linear projection. The dimensions are different, e.g. in a 1.3B model Transformers would typically have Q, K, V of hidden...

Q, K, V are bf16 for attention. u, delta, B, C, z are bf16, A and D are fp32 for scan.

Try `selective_scan_fn(u, delta, A, B, C, D)` (no z, delta_bias, delta_softplus) to see if that makes a difference?

We do not have experience with ROCm, but ofc we'd welcome community contribution on this

There's CUDA code in causal_conv1d but that's optional, we can use torch's conv1d. There's CUDA code in this repo for the selective_scan operation (`csrc`) and maybe it can work w...

We've never tried with Windows and idk much about compilation on Windows. Lmk if you figure out.

We have not tried quantization, it's an open question. Would be very interesting to understand how sensitive the model is to the SSM params. E.g. I could imagine quantizing the...