llama-cpp-python Add batch inference support (WIP)

Closes #771

This is going to be a big PR as it requires refactoring a good deal of the Llama class to make it thread safe and support multiple parallel sequences. The goal is to introduce no breaking changes as long as you don't use the new functionality, some of the lower level methods like eval, sample, generate may have undefined behaviour when the kv cache has multiple sequences, asserts will need to be raised accordingly.

[ ] Refactor Llama._create_completion spaghetti-ball (should be able to fix #914 as well)
[ ] Add support for multiple completions per request
[ ] Add support for parallel requests

Nov 28 '23 10:11 abetlen

@abetlen any progress on this? I am very interested in this feature

Dec 28 '23 12:12 turian

@abetlen I'm also curious to know if this is still a planned feature. Thank you

Jan 18 '24 18:01 thomasgauthier

same here

Jan 18 '24 23:01 dimaioksha

Hey guys, yes it is, it's just taking longer than expected because I need to refactor a lot of the Llama class internals while avoiding breaking changes. At the same time I also don't want to hold up bug fixes and llama.cpp updates.

Next steps right now are

Refactoring create_completion so it can be used for parallel sequence completions
Introduce a sampling context to manage parallel sequences state like grammars, mirostat params, etc
Add multi-completion (ie n parameter in the OpenAI api) support
Add parallel completions support through a slots api similar to llama.cpp

Jan 19 '24 14:01 abetlen

+1 on this, would really really love to see this feature - right now, I can't use llama-cpp-python in production because of it :(

Jan 20 '24 22:01 K-Mistele

Add support for multiple completions per request Add support for parallel requests

Hope we can get these two great features soon!

Feb 15 '24 08:02 aalyousfi

I am also highly interested in this, would be really really great! 😀

Feb 21 '24 18:02 parallaxe

@abetlen Hey; this would be huge if you're still working on it.

Right now I'm using 100% of the VRAM on an A40, and getting like 3% utilization for the FLOPS and Memory Bandwidth 😆

I just need to be able to throw more inferencing at it, but running the python file twice simultaneously will take up twice the VRAM (Not viable, I'm at VRAM limit already).

I will happily sponsor with the cloud compute that I will save :pray:

~ I'm not totally sure if I understand the code, but from my understanding of this PR; with this feature:

A "context" will have its own specific kv_cache
A context can be save_model/load_model'd, which will save/load all state (including kv_cache?)
And then, I can have N threads, each one can do an llm load model to initialize a context, and then .eval, and then save_model to continue that evaluation at a later time. And keep a dictionary of llm states for all clients who are waiting on evals to progress.
Target N to be whatever gives me 100% util on FLOP or Memory Bandwidth (probably bandwidth)

Or, alternatively, will the kv_cache be global? It could probably save RAM by allowing the parallel processes to share in the kv_cache, but also maybe that's harder to implement and it wouldn't matter in cases where the parallel processing threads don't share any substrings. Not sure.

Maybe totally high level idea would be to allow Llama() to be initialized multiple times, but implicitly share VRAM if multiple are initialized from the same underlying model file ~ would be easiest for the User, but maybe that gets weird with underlying implementation. Interesting ideas though.

model = Model("../models/llama-2-70b") llama1 = Llama(model) llama2 = Llama(model)

Mar 11 '24 06:03 npip99

Hey @abetlen any updates on this one? Looking to add support for this into instructlab/sdg and instructlab/instructlab !!!! Really hoping for this functionality 🙏

Oct 23 '24 15:10 cdoern

llama-cpp-python llama-cpp-python copied to clipboard

Add batch inference support (WIP)

llama-cpp-python
llama-cpp-python copied to clipboard