Tri Dao

Results 438 comments of Tri Dao
trafficstars

There's a reference implementation in pytorch but would probably be quite a bit slower

It implements the same operation, just more memory efficent (as the name suggest).

It's moved to [mamba_ssm/modules/block.py](https://github.com/state-spaces/mamba/blob/c0a00bd1808881831ddf43206c69362d4df90cf7/mamba_ssm/modules/block.py#L10)

Probably yes. How would you do it with Transformers?

Yes this is a good idea. The conv1d implementatation actually already supports taking in intial states and returning final states. We just haven't had time to wired everything together.

Yes chunk size should be a power of 2, that's what Triton supports. To deal with seqlen not divisible by chunk_size, we load with a mask. Anything outside the seqlen...

You can see the reference implementation: https://github.com/state-spaces/mamba/blob/8ffd905c91d207f5c0cc84fc2a2fb748655094f0/mamba_ssm/ops/triton/ssd_chunk_state.py#L960

No, the model architectures are different

Can you post a script that help us reproduce the error? E.g. save the tensors that produce the Nan?