Tri Dao
Tri Dao
There's a reference implementation in pytorch but would probably be quite a bit slower
It implements the same operation, just more memory efficent (as the name suggest).
It should compute the same answer
It's moved to [mamba_ssm/modules/block.py](https://github.com/state-spaces/mamba/blob/c0a00bd1808881831ddf43206c69362d4df90cf7/mamba_ssm/modules/block.py#L10)
Probably yes. How would you do it with Transformers?
Yes this is a good idea. The conv1d implementatation actually already supports taking in intial states and returning final states. We just haven't had time to wired everything together.
Yes chunk size should be a power of 2, that's what Triton supports. To deal with seqlen not divisible by chunk_size, we load with a mask. Anything outside the seqlen...
You can see the reference implementation: https://github.com/state-spaces/mamba/blob/8ffd905c91d207f5c0cc84fc2a2fb748655094f0/mamba_ssm/ops/triton/ssd_chunk_state.py#L960
No, the model architectures are different
Can you post a script that help us reproduce the error? E.g. save the tensors that produce the Nan?