jayakommuru comments

Results 17 comments of


                                            jayakommuru

Using Bert/Roberta with "tensorrtllm" backend directly ? (no Python lib like tensorrt-llm package)

Hi @byshiue are sequence classification with T5 models not supported yet?

make -C docker run LOCAL_USER=1 FAILED

@taozhang9527 @byshiue I am running `make -C docker release_run LOCAL_USER=1` but still facing this error: ```pull access denied for tensorrt_llm/release, repository does not exist or may require 'docker login': denied:...

make -C docker run LOCAL_USER=1 FAILED

@LoverLost were you able to figure this out?

Triton Inference Server Stops Processing Requests under High Traffic, GPU Utilization Stuck at 100%

@gabriel-peracio @hcnhcn012 @MrD005 were you able to find a fix for this ?

T5 model: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1)

@byshiue @schetlur-nv can you help with this? Not able to deploy the basic t5-small model, following the instructions given in https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md

are FP8 models supported in Triton ??

@oandreeva-nv can you help with this ^^ ?

are FP8 models supported in Triton ??

@oandreeva-nv Ok, Can there be any throughput/performance benefits by running FP8 TRT engine file with FP16 I/O? which triton data type should be used with FP8 TRT engine file in...

are FP8 models supported in Triton ??

@oandreeva-nv can you confirm if using FP16 I/O triton datatypes and FP8 TRT engine, does it give any benefit? Thanks

are FP8 models supported in Triton ??

@oandreeva-nv Sure, will explore the perf-analyzer. Any idea whether to use FP32 or FP16 I/O datatype of triton for TensorRT FP8 models ?

Whats the query to calculate triton model latency per request? Is it nv_inference_request_duration_us / nv_inference_exec_count + nv_inference_queue_duration_us

@oandreeva-nv can you help with this ?