llama-cpp-python
llama-cpp-python copied to clipboard
Add batch inference support (WIP)
Closes #771
This is going to be a big PR as it requires refactoring a good deal of the
Llama
class to make it thread safe and support multiple parallel sequences. The goal is to introduce no breaking changes as long as you don't use the new functionality, some of the lower level methods likeeval
,sample
,generate
may have undefined behaviour when the kv cache has multiple sequences, asserts will need to be raised accordingly.
- [ ] Refactor
Llama._create_completion
spaghetti-ball (should be able to fix #914 as well) - [ ] Add support for multiple completions per request
- [ ] Add support for parallel requests
@abetlen any progress on this? I am very interested in this feature
@abetlen I'm also curious to know if this is still a planned feature. Thank you
same here
Hey guys, yes it is, it's just taking longer than expected because I need to refactor a lot of the Llama
class internals while avoiding breaking changes. At the same time I also don't want to hold up bug fixes and llama.cpp updates.
Next steps right now are
- Refactoring create_completion so it can be used for parallel sequence completions
- Introduce a sampling context to manage parallel sequences state like grammars, mirostat params, etc
- Add multi-completion (ie
n
parameter in the OpenAI api) support - Add parallel completions support through a slots api similar to llama.cpp
+1 on this, would really really love to see this feature - right now, I can't use llama-cpp-python
in production because of it :(
Add support for multiple completions per request Add support for parallel requests
Hope we can get these two great features soon!
I am also highly interested in this, would be really really great! 😀
@abetlen Hey; this would be huge if you're still working on it.
Right now I'm using 100% of the VRAM on an A40, and getting like 3% utilization for the FLOPS and Memory Bandwidth 😆
I just need to be able to throw more inferencing at it, but running the python file twice simultaneously will take up twice the VRAM (Not viable, I'm at VRAM limit already).
I will happily sponsor with the cloud compute that I will save :pray:
~ I'm not totally sure if I understand the code, but from my understanding of this PR; with this feature:
- A "context" will have its own specific kv_cache
- A context can be save_model/load_model'd, which will save/load all state (including kv_cache?)
- And then, I can have N threads, each one can do an llm load model to initialize a context, and then .eval, and then save_model to continue that evaluation at a later time. And keep a dictionary of llm states for all clients who are waiting on evals to progress.
- Target N to be whatever gives me 100% util on FLOP or Memory Bandwidth (probably bandwidth)
Or, alternatively, will the kv_cache be global? It could probably save RAM by allowing the parallel processes to share in the kv_cache, but also maybe that's harder to implement and it wouldn't matter in cases where the parallel processing threads don't share any substrings. Not sure.
Maybe totally high level idea would be to allow Llama() to be initialized multiple times, but implicitly share VRAM if multiple are initialized from the same underlying model file ~ would be easiest for the User, but maybe that gets weird with underlying implementation. Interesting ideas though.
model = Model("../models/llama-2-70b") llama1 = Llama(model) llama2 = Llama(model)
Hey @abetlen any updates on this one? Looking to add support for this into instructlab/sdg and instructlab/instructlab !!!! Really hoping for this functionality 🙏