Martin Evans
Martin Evans
> kv cache management The BatchedExecutor exposes all of the kv cache stuff, per "Conversation". So e.g. you can shift off token, or rewind state etc. That should be a...
See #662, that update should include these things.
I don't know much about CUDA, but yes I think that would fix it (Onkitova tested it out in https://github.com/SciSharp/LLamaSharp/pull/371) Last time we discussed this ([ref](https://github.com/SciSharp/LLamaSharp/issues/350#issuecomment-1879916928)) I think we decided...
LLamaSharp intends to be threadsafe, but that's a bit tricky due to some thread safety issues in llama.cpp itself. At the moment it's set up so there's a global lock...
LLamaSharp has the BatchedExecutor which is an entirely new executor I've been working on. You can spawn multiple "Conversations" which can all be prompted and then inference runs for all...
The `BatchedExecutor` is actually already available in the previous release (although of course there will be improvements in the next release!).
I'd suggest cloning the master branch and working with that, `BatchedExecutor` is very new and I think the things you're asking about have been changed (and hopefully improved!). For example...
`BatchedExecutor` itself is not currently designed to be used in parallel (although it might be modified to allow that in the future). The parallelism is built into it - when...
Try the `BatchedExecutor` demos in LLamaSharp to get a feel for the speed. The `Fork` example starts with one conversation and keeps forking it again and again so it ends...
The basic flow for the batched executor is: 1. Create one or more conversations: ```csharp using var conversation = executor.Create(); conversation.Prompt("Hello AshD"); ``` 2. Call `Infer()` to run the model...