compilade
compilade
> It sounds like having a simple fallback of expected filenames would be a reasonable thing to include here? I don't know that we want to maintain a ton of...
@Tangshengku Bi-Mamba seems amazing! > The ppl is pretty bad with more than 3500+. So, have you ever tested the performance of your implementation before? I did test it when...
> However, I first tried to use mamba2-2.7 model and computed the ppl on wiki dataset @Tangshengku Which model exactly is causing you problems? I can't reproduce the problem with...
@EthanFS I don't think these small Mamba(1 and 2) models are instruction-trained, and so I wouldn't expect them to ever really "finish" their output (although there *are* cases where they...
> Instead of computing the w_scale and w_bias during tensor transformation, I compute the w_scale and w_bias during inference on the activation, which is equivalent to the operation on the...
> BTW, what do you mean about 'TQ1_0 and TQ2_0' are not good to this model? You mean the ppl will be bad or the speed &memory will be bad?...
There is a problem with multi-user (and/or parallel sequence) inference for recurrent models (also on `master`, so might have inherited the problem by merging the latest changes). I'll try to...
> but there's also something else which makes it seem like recurrent states of sequences are not properly isolated I found the problem! It was introduced in #12181 https://github.com/ggml-org/llama.cpp/blob/791998b42d6cd6edb31e4d5824e29c100cecd40b/src/llama-graph.cpp#L287-L291 The...
@gabe-l-hart I've been attempting to adapt the CUDA implementation of the `SSM_SCAN` operator to how it's modified for Mamba-2 (some shape changes and an extra input tensor for the state...
@vineel96 You do not need to pull #5328, since it has been merged a while ago in the `master` branch. This means you can use the latest version of `llama.cpp`,...