worker-vllm
worker-vllm copied to clipboard
The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
I'm trying to build `Llama 3.1` and `Llama 3.1 Instruct` but the build always fails (latest main or v1.2.0). These models are not supported yet? `Llama 3` and `Llama 3...
### Description 1. 🌟 **Upgrade VLLM**: We need to rocket [VLLM to version 0.5.0++](https://github.com/vllm-project/vllm/releases/tag/v0.5.0.post1) or beyond! 🚀 2. 🤖 **Tensorize Awesomeness**: The `tensorize` feature is like giving VLLM a turbo...
Gemma-2 no longer requires `flashinfer` - in fact, newest version of vllm has a bug in its usage, which makes the LLM return wrong tokens. This pull requests makes it...
The purpose of this issue is to provide full support for `tools` and `tool_choice="auto"` in worker-vllm. ## ToDo - [ ] vLLM only supports `tool_choice="some_tool_name"` and `tool_choice="none"`, but hopefully soon...
``` Traceback (most recent call last): 2024-08-01T21:29:17.880522621Z File "/src/handler.py", line 6, in 2024-08-01T21:29:17.880527641Z vllm_engine = vLLMEngine() 2024-08-01T21:29:17.880533331Z File "/src/engine.py", line 25, in __init__ 2024-08-01T21:29:17.880543011Z self.llm = self._initialize_llm() if engine is...
## Description This PR introduces UV (https://github.com/astral-sh/uv) as a replacement for pip in the Dockerfile. > An extremely fast Python package installer and resolver, written in Rust. Designed as a...
Hello you all keep scratching my head why sometimes I can deploy all on list but stuff I find having issues anyways this is my logs just trying to use...
The memory usage of vLLM's KV cache is directly proportional to the batch size of the model. vLLM's default is 256 but many users don't need nearly that many. For...
A new version(0.5.1) of VLLM has been released, could you please update it to work with runpod serverless? https://github.com/vllm-project/vllm/releases
Please update for Gemma-2.