Tri Dao comments

Results 250 comments of


                                            Tri Dao

Using input if size torch.Size([200, 100])

Mamba is a sequence-to-sequence layer, just like attention. If you don't have the length dimension (i.e. it's not a sequence) then Mamba is likely not a good choice.

revise the selective_scan_cuda

You can change the CUDA code in `csrc`

Batched generation with masking/padding

The zero-shot evals only require evaluating likelihood (to pick among multiple choices) and not generation. I don't think the current generation code supports batched generation of different lengths.

convert the Mamba model to torchscript

Sorry i have no experience with torch script

Performance gap between triton and flash attn

@void-main Are you synchronizing (torch.cuda.synchronize) when you measure the time? The measurement for FlashAttention CUDA looks like it barely changes when you increase the sequence length, that seems wrong.

bfloat16 matmul performance is worse than float16 matmul

Agreed, I'm observing the same thing where bf16 matmul is slower than fp16 matmul.

Training Script

You can use whichever training script / library you'd like, e.g. Megatron, DeepSpeed, lightning, hf accelerate etc. Just have to replace the model definition. Examples: Lightning has lit-gpt: https://github.com/Lightning-AI/lit-gpt FlashAttention...

Strict requirement of a diagonal `A`

A is technically a batch of d_inner diagonal matrices, each of size d_state x d_state. Since it's diagonal, we don't need to store all the d_state x d_state entries, we...

About max token length

It was trained with seqlen=2k for apple to apple comparison with pythia, seems to extrapolate to around 3k context length but after that the quality is much worse.

About max token length

Yes training on longer context (e.g. 4k or 8k) should help improve max token length. I think this is a general property of most sequence models (e.g. Transformers should be...