worker-vllm icon indicating copy to clipboard operation
worker-vllm copied to clipboard

[feat] ability to set max_num_seqs

Open kalocide opened this issue 6 months ago • 1 comments

The memory usage of vLLM's KV cache is directly proportional to the batch size of the model. vLLM's default is 256 but many users don't need nearly that many. For example, someone running a personal model (1 request at a time) only needs a cache size of 1. Unfortunately, the default value is designed for very large parallel inference, which makes it prohibitive to run models fast on anything but the largest type of card. I think that being able to adjust this value would be an easy win for the performance and usefulness of this repo.

I can write up a PR for this if it works better; I think I know what needs to be done. I'm just not very famiiliar with RunPod serverless right now.

kalocide avatar Jul 30 '24 06:07 kalocide