Nathan Price
Nathan Price
This PR provides documentation for converting lora adapters from a hugging face checkpoint into a warmup that can be used in the triton-inference-server TensorRT-LLM backend. This approach allows for the...
### System Info Debian 11 `nvidia-smi` ``` +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC...
I would like to have labels applied which will be populated from the content of the request body. tried something like: ``` async def get_label_value(request:Request): return request.json().get("label", None) app.add_middleware( PrometheusMiddleware,...
Working to allow for custom histogram binning to be applied to prometheus metrics. Ideally this would be able to be applied to activities as well as workflows. The current implementation...
Currently the default binning for activity metrics are top out at 60 seconds. This limits my observability into activities which might take a long time or even take more than...
# Add Priority Request Support for vLLM Async Engine ## Description This PR adds support for priority-based request scheduling in the vLLM async engine. When the engine is configured with...
### 🐛 Describe the bug When I scale up my deployment using a lora adapter I see that all the traffic to the lora adapter always goes to the pod...