tensorrtllm_backend issues

Example `gpu_device_ids` for multi-model usage?

1

### System Info P4D (A100 40 GB x 8) ### Who can help? @juney-nvidia @byshiue ### Information - [X] The official example scripts - [ ] My own modified scripts...

vnkc1

question

`max_batch_size` seems to have no impact on model performance

8

### System Info - CPU architecture: x86_64 - GPU: 1 x Nvidia A100 - Docker image for LLM serialization: nvidia/cuda:12.1.0-devel-ubuntu22.04 - Docker image for triton server launch: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 - TensorRT...

VitalyPetrov

bug

triaged

the result use inflight_batcher_llm_client to send multiple lora weights is not same as use tensorrtllm

3

case1：use tensorrtllm python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir "/data512/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/" \ --max_output_len 2048 \ --tokenizer_dir "/tensorrtllm_backend/tokenizer" \ --input_text "system\nYou are a helpful assistant.\nuser\nWhat is the intention of the following user questions? \Can you help...

stifles

triaged

[MINOR] Fix typo in README

This PR fixes the typo and wrong reference link in README.md.

kooyunmo

Fix batch manager stats link

The link from backend metrics to TRT-LLM batch manager stats is broken, so fixing it on public facing side for user viz.

rmccorm4

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck.

17

As indicated by the title, on the main branch, I used 40 threads to simultaneously send inference requests to the in-flight Triton Server, resulting in the Triton Server getting stuck....

StarrickLiu

Feature request: support multiple model instances on TensorRT LLM triton backend.

15

I used Baichuan2 13B model weight only int 8 and launch a triton server on single GPU. Now I have a node has 2 GPUs and want to multiple model...

wengsnow

triaged

feature request

dynamic batching not working properly with tensorrtllm_backend

3

**Description** priority not working properly with tensorrtllm_backend **Triton Information** Triton: 2.43.0 tensorrtllm_backend: 0.8.0 Are you using the Triton container or did you build it yourself? container **To Reproduce** vim `tensorrtllm_backend/inflight_batcher_llm/client/end_to_end_grpc_client.py`...

gavinzb

Can't launch triton server following docs, expecting [TensorRT] library version 9.2.0.5 got 9.3.0.1

5

### System Info - CPU architecture x86_64 - Nvidia H100 GPU - docker image `nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3` - TensorRT-LLM tag v0.9.0 - tensorrtllm_backend tag v0.9.0 - Ubuntu 22.04 ### Who can help?...

conway-abacus

bug

Encountered an error in forward function: std::bad_cast

1

### System Info - CPU architecture x86_64 - GPU NVIDIA A100 - TensorRT-LLM branch main - TensorRT-LLM commit 71d8d4d3dc655671f32535d6d2b60cab87f36e87 - ### Who can help? @juney-nvidia @kaiyux ### Information - [x]...

wangqy1216

bug

tensorrtllm_backend
tensorrtllm_backend copied to clipboard

Metadata

Example `gpu_device_ids` for multi-model usage?

`max_batch_size` seems to have no impact on model performance

the result use inflight_batcher_llm_client to send multiple lora weights is not same as use tensorrtllm

[MINOR] Fix typo in README

Fix batch manager stats link

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck.

Feature request: support multiple model instances on TensorRT LLM triton backend.

dynamic batching not working properly with tensorrtllm_backend

Can't launch triton server following docs, expecting [TensorRT] library version 9.2.0.5 got 9.3.0.1

Encountered an error in forward function: std::bad_cast

← Metadata

Owner

Metadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

tensorrtllm_backend
tensorrtllm_backend copied to clipboard