Tri Dao
Tri Dao
Mamba is a sequence-to-sequence layer, just like attention. If you don't have the length dimension (i.e. it's not a sequence) then Mamba is likely not a good choice.
You can change the CUDA code in `csrc`
The zero-shot evals only require evaluating likelihood (to pick among multiple choices) and not generation. I don't think the current generation code supports batched generation of different lengths.
Sorry i have no experience with torch script
@void-main Are you synchronizing (torch.cuda.synchronize) when you measure the time? The measurement for FlashAttention CUDA looks like it barely changes when you increase the sequence length, that seems wrong.
Agreed, I'm observing the same thing where bf16 matmul is slower than fp16 matmul.
You can use whichever training script / library you'd like, e.g. Megatron, DeepSpeed, lightning, hf accelerate etc. Just have to replace the model definition. Examples: Lightning has lit-gpt: https://github.com/Lightning-AI/lit-gpt FlashAttention...
A is technically a batch of d_inner diagonal matrices, each of size d_state x d_state. Since it's diagonal, we don't need to store all the d_state x d_state entries, we...
It was trained with seqlen=2k for apple to apple comparison with pythia, seems to extrapolate to around 3k context length but after that the quality is much worse.
Yes training on longer context (e.g. 4k or 8k) should help improve max token length. I think this is a general property of most sequence models (e.g. Transformers should be...