Tri Dao
                                            Tri Dao
                                        
                                    `step` is only implemented for generation, so the backward pass was not implemented.
Sorry we're not familiar with how huggingface's `transformers` implement Mamba.
We train with bf16, idk about fp16.
We have no experience with ONNX. Do you have ideas on how to generate onnx for custom operations? If so would you like to contribute?
No we don't have an analogue to cross attention.
That's an open research question.
`const int n_chunks = (seqlen + 2048 - 1) / 2048` We do the scan on chunks of length at most 2048 (i.e. process one chunk at a time before...
This takes less space than the input for most values of dstate.
There's no encoder-decoder afaik, you can try concatenation.
There are some work on "bidirectional" Mamba, you can search.