vllm_backend
vllm_backend copied to clipboard
Currently relative paths to local models are resolved relative to the triton server process. However when deploying models to a central model registry one may not know in advance where...
Like `usage` of OpenAI: https://platform.openai.com/docs/api-reference/chat/object#chat/object-usage 
Please See: this PR should be reviewed and merged after the server's PR: https://github.com/triton-inference-server/server/pull/7500
#### What does the PR do? Report more counter, histogram, gauge metrics from vLLM to Triton metrics server. **Checklist**: - [x] PR title reflects the change and is of format...
# Issue Only multi-modal input supported in vllm backend is Llama 3.2 # Contribution - Add support for qwen2.5 multi-modal input - Refactor code to to easily add other multi-modal...
# Add Priority Request Support for vLLM Async Engine ## Description This PR adds support for priority-based request scheduling in the vLLM async engine. When the engine is configured with...
TODO: [X] Fix non-graceful shut down [ ] re-implement `build_async_engine_client_from_engine_args` for our use-case [ ] implement ProxyStatLogger(VllmStatLoggerBase), which will be attached to a `MQLLMEngine` process and pass metrics updates via...