compilade comments

Results 109 comments of


                                            compilade

llama : support Jamba hybrid Transformer-Mamba models

For the first time state saving and reloading works for Jamba (both for the whole state and single-sequences). 🎉 This is implemented in > I'm thinking that the changes for...

llama : support Jamba hybrid Transformer-Mamba models

> I suspect your implementation of `llama_rs_cache` is a much better approach than the one I took of simply creating a duplicate `llama_kv_cache` and conditionally making the two caches have...

llama : support Jamba hybrid Transformer-Mamba models

> I'm now able to run a lightweight mamba2 model (details below). @gabe-l-hart Amazing! I've also merged from latest `master` (into ), and some parts differ, but most is similar...

fix: Fixes wrong input type for raw_dtype in ggml to gguf scripts

Thanks for finding this and fixing it. There has been many refactors lately where the old `convert_llama_ggml_to_gguf.py` was not tested at all. (mostly because I don't have old GGML models...

fix: Fixes wrong input type for raw_dtype in ggml to gguf scripts

> Is the [failed CI check](https://github.com/ggerganov/llama.cpp/actions/runs/10333572638/job/28606150773?pr=8928) required for merging this PR, do I need to do anything about it? as it does not seem to be related to this PR....

So SLOW with NVidia GPU and Codestral model

@dlippold Note that the model referred here is not `Mamba-Codestral-7B-v0.1`, but `Codestral-22B-v0.1`. Implementing support for `Mamba-Codestral-7B-v0.1` will not affect the performance of `Codestral-22B-v0.1`, because they use totally different architectures (Mamba-2...

Feature Request: Some way to handle KV cache allocation failure during individual slot restore

@kaetemi Defragmenting when it fails should be good enough, and should be fast enough (I think). `llama_kv_cache_defrag` should do the right thing, but only at the next `llama_kv_cache_update` or `llama_decode`....

Add --pre-tokenizer option to convert

For models like MiniCPM-V-2.5, should their `Model` subclass instead simply hardcode and override `get_vocab_base_pre` to the desired pre-tokenizer? Otherwise the user needs to know the specific incantation required, and could...

Add --pre-tokenizer option to convert

> @compilade Is it ok to merge this? @Galunid I'm not sure, since this exposes a way to easily make invalid model files without any warning. > I meant this...

Bug: After converting the InternLM2 7b from LLamaFactory and importing it into ollama, i get an error: tensor 'token_embd.weight' has wrong shape.

> `llm_load_print_meta: n_vocab = 92550` > `INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 92544}` @Sakura4036 The vocab size does not match the tensor size. Try to modify the `vocab_size` field...