Andrei
Andrei
That's the approach I was initially trying but it caused [this assert](https://github.com/ggerganov/llama.cpp/blob/73bac2b11d7d3e20982fc9ee607625836387db8b/llama.cpp#L12293) to fail as the logits aren't reserved when `cparams.causal_attn` is false. However I think I was just missing...
@ggerganov was able to come back to this and finally get it working. Changes: - Added a `llama_token_inp_embd` function to the `llama.h` API which translates a set of input tokens...
@ggerganov no problem, I'll work with @ngxson and see if I can provide support on that PR.
Hey @ggerganov I missed this earlier. Thank you, yeah I just need some quick clarifications around the kv cache behaviour. The following is my understanding of the `kv_cache` implementation -...
Hi @agunapal sorry to get to this so late, are you setting the `n_gpu_layers` parameter? This is required to offload layers to the gpu and is off by default.
@agunapal thanks for providing that. It looks like the issue might actually be with llama.cpp / your version of metal as it's only happening when the metal kernel file is...
@agunapal yeah that's very strange, can you post the top part of the `./main` command where it's setting up? You can also build `llama.cpp` as a shared library with `cmake...
@agunapal try setting n_gpu_layers to 1 now.
Hey @BlackLotus you're exactly right, the choices key actually comes from the [OpenAI API](https://platform.openai.com/docs/api-reference/completions/object) but it's unnused in this library at the moment. I'm currently working on the multi-completion feature...
@Smartappli I was looking at this a few months ago as well because uv is a pretty amazing tool. The issue I ran into however is that it doesn't log...