Tri Dao
Tri Dao
So that we use self.A_log as parameter and there's no restriction. If we parameterize A as a parameter directly it's harder to constrain A to be positive (which is what...
I just pushed a version of causal-conv1d, can you try again?
We don't have much experience with tabular data but you can try.
That's not supported in the CUDA code, but you can play around with selective_scan_ref which is in Pytorch (but much slower). Instead of multiplying A with previous hidden states pointwise...
I don't have experience with model merging, keeping this issue open in case there are others who can help.
Can you try again with the latest version of `mamba-ssm`? We've just updated it.
I'm not familiar with the GGUF format but perhaps others might be able to help.
The models were trained with 2k context, it's cool that passkey retrieval works up to 3-4k tokens. Would be cool to train Mamba with longer context and see how it...
Did you follow the suggestion in the error message?
I'm not sure where the randomness is from. Can you comment out lines in the Mamba implementation to isolate?