compilade
compilade
> RWKV should reconsider to implement on llama cpp given recent merge of MAMBA SSM. If nobody else does it, I'll have time to work on RWKV in `llama.cpp` starting...
> I'm hitting some issues with the vk cache initialization The KV cache for recurrent models is sized from the GGUF metadata keys `{model}.ssm.state_size`, `{model}.ssm.inner_size`, and `{model}.ssm.kernel_size`. These get read...
> in my case, [PAD151645] [PAD151644] [PAD151643] shows in output(Qwen-14B-Chat q4k) Regarding the `[PAD{id}]` tokens, I recently fixed this in #5052, but it requires re-converting existing Qwen models. > Solution:...
> I originally opened this issue, not so much that had a big interest in getting it fixed for myself, but because I thought community might want to aware that...
@saharNooby > Do PyTorch and ggml store values differently? I also noticed that in llama.cpp, when converting the model, dimensions are reversed, but data is left untouched -- looks related....
I do not agree with this change (but I like the underlying intention of making `llama.cpp` less confusing). As I'm working on supporting Mamba in `llama.cpp` (see #5328), I'd like...
> @compilade ust out of curiosity, is any convolution operation performed? I see some tensors with the name `conv`, but I never see `ggml_conv_1d` or `ggml_conv_2d` being used at any...
> Regarding the KV questions: > IIUC one slot is needed per sequence, so in that sense the KV cache size could be interpreted as the maximum number of distinct...
I've been thinking about what parts of the KV cache API can and cannot be supported for Mamba. In general, functions which operate on whole sequences or the whole KV...
Now that multiple sequences can be processed at once, I've been trying to make the `server` example work with Mamba. >>I think that most of what is currently done with...