Tri Dao comments

Results 280 comments of


                                            Tri Dao

I guess mamba.step could be deleted if selective_scan_fn can accept ssm_state as an input param.

Yep, in some sense `step` is the specialized version of `forward` that accepts ssm_state and only move by 1 step.

why not try 7B or more?

Thanks for the suggestion, we'll try it :D

How do I use mamba on windows?

I don't have access to any windows machine but maybe someone who does can comment on how to compile for Windows.

CrossAttention To CrossMamba

The source could have a different length to the target I think? So using the source to compute the projections won't give you the correct dimension?

selective_scan_cuda error

You can put the import in the try except, but I wouldn't call the `selective_scan_ref` function in `selective_scan_fn` if `selective_scan_cuda` is not found. Instead it should error. We don't want...

Figure 4 -- Mamba vs. Transformer 1.3B and 6.9B Mamba

The appendix contains the details. We follow GPT3 specs (e.g. for 7B, hidden dim = 4096).

Figure 4 -- Mamba vs. Transformer 1.3B and 6.9B Mamba

For 7B we also follow GPT3: hidden dim = 4096, layer = 64 (2 mamba layers have the same number of params as 1 block of attn + MLP). The...

Mamba-block without convolutional layer

Sure, you can just remove it in the pytorch code.

Variable input sequence length

Variable length is not currently implemented but will be in the future. For now you can pad your sequences.

Variable input sequence length

Yes padding tokens should be on the right.