server
server copied to clipboard
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
I'd like to implement a repository_agent to do a checksum on a remote repository with artifact_type == TRITONREPOAGENT_ARTIFACT_REMOTE_FILESYSTEM, but checksum_repository_agent only supports TRITONREPOAGENT_ARTIFACT_FILESYSTEM My idea is to download the files...
How can I release the GPU memory used by triton_python_backend_stub when using the Python backend?
when I using python backend, I find there is a process named "triton_python_backend_stub" which hold lost of gpu memory, and after some inferences, it's getting bigger and bigger, So I...
As shown in the question, I plan to encapsulate the output of an API in the Triton server. How can I implement streaming output?
Is your feature request related to a problem? Please describe. I'm experiencing challenges with Triton's dynamic batching for audio processing tasks. Currently, dynamic batching only works with inputs of identical...
#### What does the PR do? This PR adds support for using multiple tokenizers in the OpenAI-compatible frontend, allowing different models to use their own specific tokenizers. This is crucial...
#### What does the PR do? Build using the PA binaries and whl if available. The PR additionally removes the PA tests that are not maintained from the server repo....
In ensemble mode, is it possible to set instance_group gpus: [0,1,2,3] in the config.pbtxt so that the pipeline runs in the following manner: preprocessing (GPU0) ---> inference (GPU0), preprocessing (GPU1)...
Hi, I have been going through the Triton Inference Server vLLM backend ; one of the important support has been added asynchronous inference which is critical for text generation and...
**Description** Hello! I am getting OOM VRAM when I use vllm_backend. I run [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) via Triton Inference Server with vllm_backend and start bombarding the Triton Inference Server API with requests...
**Description** When running an onnx model on trition with tensorrt, the first requests are always timing out and inference requests are working as expected only 1-2 minutes after that. Configuring...