server issues

How can I use triton core/src /filesystem

1

I'd like to implement a repository_agent to do a checksum on a remote repository with artifact_type == TRITONREPOAGENT_ARTIFACT_REMOTE_FILESYSTEM, but checksum_repository_agent only supports TRITONREPOAGENT_ARTIFACT_FILESYSTEM My idea is to download the files...

zjhong12581

How can I release the GPU memory used by triton_python_backend_stub when using the Python backend?

2

when I using python backend, I find there is a process named "triton_python_backend_stub" which hold lost of gpu memory, and after some inferences, it's getting bigger and bigger, So I...

lzcchl

If I want to implement streaming output for calling OpenAI API, which document should I refer to?

As shown in the question, I plan to encapsulate the output of an API in the Triton server. How can I implement streaming output?

zdxff

Feature Request: Support for Dynamic Batching with Variable-Length Inputs in Audio Processing

1

Is your feature request related to a problem? Please describe. I'm experiencing challenges with Triton's dynamic batching for audio processing tasks. Currently, dynamic batching only works with inputs of identical...

YuBeomGon

feat: Adding multiple tokenizers specification for open ai frontend

#### What does the PR do? This PR adds support for using multiple tokenizers in the OpenAI-compatible frontend, allowing different models to use their own specific tokenizers. This is crucial...

oandreeva-nv

Build: Build using the PA binaries and whl if available.

#### What does the PR do? Build using the PA binaries and whl if available. The PR additionally removes the PA tests that are not maintained from the server repo....

pvijayakrish

ensemble multi-GPU

3

In ensemble mode, is it possible to set instance_group gpus: [0,1,2,3] in the config.pbtxt so that the pipeline runs in the following manner: preprocessing (GPU0) ---> inference (GPU0), preprocessing (GPU1)...

xiazi-yu

question

module: server

[Question] Triton Inference server vLLM backend vs vLLM serve

1

Hi, I have been going through the Triton Inference Server vLLM backend ; one of the important support has been added asynchronous inference which is critical for text generation and...

pradghos

OOM VRAM when using vllm_backend

1

**Description** Hello! I am getting OOM VRAM when I use vllm_backend. I run [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) via Triton Inference Server with vllm_backend and start bombarding the Triton Inference Server API with requests...

ArtemBiliksin

Readiness probes not working for onnx-tensorrt models

1

**Description** When running an onnx model on trition with tensorrt, the first requests are always timing out and inference requests are working as expected only 1-2 minutes after that. Configuring...

samueltrautwein

server
server copied to clipboard

Metadata

How can I use triton core/src /filesystem

How can I release the GPU memory used by triton_python_backend_stub when using the Python backend?

If I want to implement streaming output for calling OpenAI API, which document should I refer to?

Feature Request: Support for Dynamic Batching with Variable-Length Inputs in Audio Processing

feat: Adding multiple tokenizers specification for open ai frontend

Build: Build using the PA binaries and whl if available.

ensemble multi-GPU

[Question] Triton Inference server vLLM backend vs vLLM serve

OOM VRAM when using vllm_backend

Readiness probes not working for onnx-tensorrt models

← Metadata

Owner

Metadata

server server copied to clipboard

Metadata

← Metadata

Owner

Metadata

server
server copied to clipboard