fairydreaming

Results 85 comments of fairydreaming

> Probably for another PR, but performance right now is not ideal due to tokenization speed. Specifically, recomputing `token_matcher` and `user_defined_token_matcher` on each call is quite costly. It might be...

Is there anything preventing this from being merged?

@cyanic-selkie I think it may work if you run the `llama-export-lora` command on the unquantized model and lora adapter files and then quantize the resulting merged model file.

From the debugging the exact cause of failure is `GGML_OP_CPY` failing during `ggml_compute_forward_dup()` where `dst->src[0]` tensor is `GGML_TYPE_Q8_0` while `dst` tensor is `GGML_TYPE_F32`. This `GGML_OP_CPY` is a result of `ggml_cast(ctx0,...

@ngxson or at least mention this in `llama-export-lora` README.md, as it currently contains example with quantized lora adapter that from what you say is not supported: ```bash ./bin/llama-export-lora \ -m...

I had some ideas about this, not sure if they are feasible, but... - Would be nice if we could use caching at the level of individual tensors instead of...

> The `llama_context` now also implements a "graph building" interface `llama_graph_i`. The idea is that every model will utilize this interface to create its compute graphs. For example, where a...

@ggerganov I think there's still one thing missing. There should be an abstract kv cache interface, `llama_kv_cache_i` or something like this that caches would implement (and `llama_context::get_kv_self()` would return this...

> @ggerganov I think there's still one thing missing. There should be an abstract kv cache interface, `llama_kv_cache_i` or something like this that caches would implement (and `llama_context::get_kv_self()` would return...

@ggerganov I got DeepSeek R1 working with custom MLA cache and context type (still have to test cache save/restore), few thoughts that came to my mind while working on this:...