Nrepesh Joshi
Nrepesh Joshi
I have an onnx model that I converted using the symbolic_shape_infer.py script in the documentation [here](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/symbolic_shape_infer.py) from the TensorRT documentation [here](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html). I then added the code below to the config...
**Description** We start the triton model server which loads in our models with warmups. It takes around 20958MiB / 81920MiB when the triton model server is stable, healthy and ready...
### Your current environment vllm/vllm-openai:latest docker image or [v0.7.2](https://hub.docker.com/layers/vllm/vllm-openai/v0.7.2/images/sha256-65009b48651a8bc216ab57ed64d7c3d0b0ee8cec77674ccdbcb5f0e8362793a1) ### 🐛 Describe the bug ``` version: '3.8' services: vllm-vllama-api: image: vllm/vllm-openai:latest environment: - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} volumes: - ./vllm-llama3.2:/root/.cache/huggingface command: [ "--model",...