Tri Dao comments

Results 250 comments of


                                            Tri Dao

Wikitext pipeline

Thanks, we'll fix that in the README. Mamba 130m has 24 layers, matching Transformers with 12 layers.

Wikitext pipeline

> Thank you for the detailed reply! I found that the smallest mamba-130m model uses 24 layers instead of 12, according to the config. Is this a case for the...

how to compare mamba with flashattention2

We compared attention time (softmax(QK^T)V) vs scan time, without the linear projection. The dimensions are different, e.g. in a 1.3B model Transformers would typically have Q, K, V of hidden...

how to compare mamba with flashattention2

Q, K, V are bf16 for attention. u, delta, B, C, z are bf16, A and D are fp32 for scan.

how to compare mamba with flashattention2

Try `selective_scan_fn(u, delta, A, B, C, D)` (no z, delta_bias, delta_softplus) to see if that makes a difference?

Rocm support

We do not have experience with ROCm, but ofc we'd welcome community contribution on this

Rocm support

There's CUDA code in causal_conv1d but that's optional, we can use torch's conv1d. There's CUDA code in this repo for the selective_scan operation (`csrc`) and maybe it can work w...

Is a linux system necessary? Is a windoms system acceptable?

We've never tried with Windows and idk much about compilation on Windows. Lmk if you figure out.

Chunk-Wise Recurrence

Soon :D

Quantization

We have not tried quantization, it's an open question. Would be very interesting to understand how sensitive the model is to the SSM params. E.g. I could imagine quantizing the...