Tri Dao

Results 280 comments of Tri Dao

Yep, in some sense `step` is the specialized version of `forward` that accepts ssm_state and only move by 1 step.

Thanks for the suggestion, we'll try it :D

I don't have access to any windows machine but maybe someone who does can comment on how to compile for Windows.

The source could have a different length to the target I think? So using the source to compute the projections won't give you the correct dimension?

You can put the import in the try except, but I wouldn't call the `selective_scan_ref` function in `selective_scan_fn` if `selective_scan_cuda` is not found. Instead it should error. We don't want...

The appendix contains the details. We follow GPT3 specs (e.g. for 7B, hidden dim = 4096).

For 7B we also follow GPT3: hidden dim = 4096, layer = 64 (2 mamba layers have the same number of params as 1 block of attn + MLP). The...

Sure, you can just remove it in the pytorch code.

Variable length is not currently implemented but will be in the future. For now you can pad your sequences.

Yes padding tokens should be on the right.