server [Critical] Triton stops processing requests and crashes

Description When running the latest Triton Inference Server - everything runs fine. It can be normal for multiple hours but suddenly the Triton Server lags. It has 100% GPU Utilization and the pending request count grows until RAM is full and the server crashes.

Pending requests:

GPU Utilization:

Triton Information What version of Triton are you using? 24.08

Are you using the Triton container or did you build it yourself? Triton Container

To Reproduce Steps to reproduce the behavior. It is hard to reproduce because it takes multiple hours/days. Just multiple models getting inferenced on TensorRT backend. Possibly a race condition or a dead-lock?

Describe the models: I have ~30 TensorRT models running. The inference requests come randomly. The server has 2 A5000 GPUs.

Here is a log when the server hangs up. From the log files (verbosity level - 2), you can see that at one point the state doesn't change anymore.

This is a normal scenario:

triton-server | I0924 12:01:22.876361 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:22.876369 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac75430] request id: , model: lama-inpaint-512, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0\noriginal inputs:\n[0x0x7f9feac9b0f8] input: input0, type: FP32, original shape: [1,4,512,512], batch + shape: [1,4,512,512], shape: [4,512,512]\noverride inputs:\ninputs:\n[0x0x7f9feac9b0f8] input: input0, type: FP32, original shape: [1,4,512,512], batch + shape: [1,4,512,512], shape: [4,512,512]\noriginal requested outputs:\nrequested outputs:\noutput0\n" triton-server | I0924 12:01:22.876382 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:22.876407 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING"

Here you can see most inference request just go to INITIALIZED state and never go to EXECUTING.

triton-server | I0924 12:01:23.263097 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:23.263104 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac945b0] request id: , model: camera-segmentation, requested version: 24, actual version: 24, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac94d88] input: input1, type: FP32, original shape: [1,3,1408,1408], batch + shape: [1,3,1408,1408], shape: [3,1408,1408]\noverride inputs:\ninputs:\n[0x0x7f9feac94d88] input: input1, type: FP32, original shape: [1,3,1408,1408], batch + shape: [1,3,1408,1408], shape: [3,1408,1408]\noriginal requested outputs:\nrequested outputs:\noutput1\n" triton-server | I0924 12:01:23.263117 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:23.432602 1 http_server.cc:4580] "HTTP request: 2 /v2/models/pano-pairing-matcher/versions/38/infer" triton-server | I0924 12:01:23.432618 1 model_lifecycle.cc:339] "GetModel() 'pano-pairing-matcher' version 38" triton-server | I0924 12:01:23.432625 1 model_lifecycle.cc:339] "GetModel() 'pano-pairing-matcher' version 38" triton-server | I0924 12:01:23.432650 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:23.432659 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac76dc0] request id: , model: pano-pairing-matcher, requested version: 38, actual version: 38, flags: 0x0, correlation id: 0, batch size: 5, priority: 0, timeout (us): 0\noriginal inputs:\n[0x0x7f9feac73438] input: img1, type: FP32, original shape: [5,1,512,512], batch + shape: [5,1,512,512], shape: [1,512,512]\n[0x0x7f9feac94918] input: img0, type: FP32, original shape: [5,1,512,512], batch + shape: [5,1,512,512], shape: [1,512,512]\noverride inputs:\ninputs:\n[0x0x7f9feac94918] input: img0, type: FP32, original shape: [5,1,512,512], batch + shape: [5,1,512,512], shape: [1,512,512]\n[0x0x7f9feac73438] input: img1, type: FP32, original shape: [5,1,512,512], batch + shape: [5,1,512,512], shape: [1,512,512]\noriginal requested outputs:\nrequested outputs:\nmconf\nmkpts0_c\nmkpts1_c\n" triton-server | I0924 12:01:23.432675 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:23.432736 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING" triton-server | I0924 12:01:23.432752 1 tensorrt.cc:390] "model pano-pairing-matcher, instance pano-pairing-matcher_0_0, executing 1 requests" triton-server | I0924 12:01:23.432758 1 instance_state.cc:359] "TRITONBACKEND_ModelExecute: Issuing pano-pairing-matcher_0_0 with 1 requests" triton-server | I0924 12:01:23.432764 1 instance_state.cc:408] "TRITONBACKEND_ModelExecute: Running pano-pairing-matcher_0_0 with 1 requests" triton-server | I0924 12:01:23.432790 1 instance_state.cc:1431] "Optimization profile default [0] is selected for pano-pairing-matcher_0_0" triton-server | I0924 12:01:23.432806 1 pinned_memory_manager.cc:198] "pinned memory allocation: size 5242880, addr 0x7fa0ae5a00d0" triton-server | I0924 12:01:23.433085 1 pinned_memory_manager.cc:198] "pinned memory allocation: size 5242880, addr 0x7fa0aeaa00e0" triton-server | I0924 12:01:23.433348 1 instance_state.cc:914] "Context with profile default [0] is being executed for pano-pairing-matcher_0_0" triton-server | I0924 12:01:23.992016 1 http_server.cc:4580] "HTTP request: 2 /v2/models/still-sky-segmentation/versions/3/infer" triton-server | I0924 12:01:23.992033 1 model_lifecycle.cc:339] "GetModel() 'still-sky-segmentation' version 3" triton-server | I0924 12:01:23.992043 1 model_lifecycle.cc:339] "GetModel() 'still-sky-segmentation' version 3" triton-server | I0924 12:01:23.992069 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:23.992076 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac768f0] request id: , model: still-sky-segmentation, requested version: 3, actual version: 3, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac99d28] input: scaled_input_uint8, type: UINT8, original shape: [1,3,2304,3072], batch + shape: [1,3,2304,3072], shape: [3,2304,3072]\noverride inputs:\ninputs:\n[0x0x7f9feac99d28] input: scaled_input_uint8, type: UINT8, original shape: [1,3,2304,3072], batch + shape: [1,3,2304,3072], shape: [3,2304,3072]\noriginal requested outputs:\nrequested outputs:\n3003\n" triton-server | I0924 12:01:23.992089 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:24.021985 1 http_server.cc:4580] "HTTP request: 2 /v2/models/camera-segmentation/versions/24/infer" triton-server | I0924 12:01:24.021999 1 model_lifecycle.cc:339] "GetModel() 'camera-segmentation' version 24" triton-server | I0924 12:01:24.022005 1 model_lifecycle.cc:339] "GetModel() 'camera-segmentation' version 24" triton-server | I0924 12:01:24.022024 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:24.022033 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac6fcd0] request id: , model: camera-segmentation, requested version: 24, actual version: 24, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac75aa8] input: input1, type: FP32, original shape: [1,3,1408,1408], batch + shape: [1,3,1408,1408], shape: [3,1408,1408]\noverride inputs:\ninputs:\n[0x0x7f9feac75aa8] input: input1, type: FP32, original shape: [1,3,1408,1408], batch + shape: [1,3,1408,1408], shape: [3,1408,1408]\noriginal requested outputs:\nrequested outputs:\noutput1\n" triton-server | I0924 12:01:24.022046 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:24.477916 1 http_server.cc:4580] "HTTP request: 2 /v2/models/window-crop-segmentation/versions/2/infer" triton-server | I0924 12:01:24.477932 1 model_lifecycle.cc:339] "GetModel() 'window-crop-segmentation' version 2" triton-server | I0924 12:01:24.477940 1 model_lifecycle.cc:339] "GetModel() 'window-crop-segmentation' version 2" triton-server | I0924 12:01:24.477962 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:24.477970 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac98dd0] request id: , model: window-crop-segmentation, requested version: 2, actual version: 2, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac75be8] input: input, type: FP32, original shape: [1,1408,1408,3], batch + shape: [1,1408,1408,3], shape: [1408,1408,3]\noverride inputs:\ninputs:\n[0x0x7f9feac75be8] input: input, type: FP32, original shape: [1,1408,1408,3], batch + shape: [1,1408,1408,3], shape: [1408,1408,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\n" triton-server | I0924 12:01:24.477983 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:24.583332 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:24.583345 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:24.583352 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:24.583375 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:24.583382 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac98450] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac759a8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feac759a8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:24.583396 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:24.583452 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING" triton-server | I0924 12:01:24.583490 1 tensorrt.cc:390] "model face-detector, instance face-detector_0_0, executing 1 requests" triton-server | I0924 12:01:24.583496 1 instance_state.cc:359] "TRITONBACKEND_ModelExecute: Issuing face-detector_0_0 with 1 requests" triton-server | I0924 12:01:24.583501 1 instance_state.cc:408] "TRITONBACKEND_ModelExecute: Running face-detector_0_0 with 1 requests" triton-server | I0924 12:01:24.583536 1 instance_state.cc:1431] "Optimization profile default [0] is selected for face-detector_0_0" triton-server | I0924 12:01:24.583555 1 pinned_memory_manager.cc:198] "pinned memory allocation: size 25165824, addr 0x7fa0aefa00f0" triton-server | I0924 12:01:24.588746 1 instance_state.cc:914] "Context with profile default [0] is being executed for face-detector_0_0" triton-server | I0924 12:01:24.937789 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:24.937804 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:24.937811 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:24.937833 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:24.937838 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac96b60] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac97488] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feac97488] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:24.937852 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:25.140624 1 http_server.cc:4580] "HTTP request: 0 /v2/models/room-layout-estimation-decoder/versions/3/ready" triton-server | I0924 12:01:25.140639 1 model_lifecycle.cc:339] "GetModel() 'room-layout-estimation-decoder' version 3" triton-server | I0924 12:01:25.266410 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:25.266424 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.266433 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.266460 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:25.266470 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac9c030] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac88938] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feac88938] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:25.266489 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:25.447365 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:25.447378 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.447386 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.447411 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:25.447421 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac89190] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac886f8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feac886f8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:25.447439 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:25.546114 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:25.546128 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.546135 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.546155 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:25.546160 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac8a5c0] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac8b0b8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feac8b0b8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:25.546174 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:25.674756 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:25.674769 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.674776 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.674795 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:25.674801 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac9dcb0] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac8c9a8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feac8c9a8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:25.674814 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:25.768096 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:25.768109 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.768116 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.768136 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:25.768142 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac9e930] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac9e718] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feac9e718] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:25.768156 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:25.933164 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:25.933177 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.933184 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:25.933203 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:25.933208 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feaca0690] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feac9f398] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feac9f398] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:25.933221 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:26.139904 1 http_server.cc:4580] "HTTP request: 2 /v2/models/still-classifier/versions/5/infer" triton-server | I0924 12:01:26.139919 1 model_lifecycle.cc:339] "GetModel() 'still-classifier' version 5" triton-server | I0924 12:01:26.139927 1 model_lifecycle.cc:339] "GetModel() 'still-classifier' version 5" triton-server | I0924 12:01:26.139946 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:26.139952 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feaca1ab0] request id: , model: still-classifier, requested version: 5, actual version: 5, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feaca1348] input: input.1, type: FP32, original shape: [1,3,224,224], batch + shape: [1,3,224,224], shape: [3,224,224]\noverride inputs:\ninputs:\n[0x0x7f9feaca1348] input: input.1, type: FP32, original shape: [1,3,224,224], batch + shape: [1,3,224,224], shape: [3,224,224]\noriginal requested outputs:\nrequested outputs:\nsigmoid\n" triton-server | I0924 12:01:26.139964 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:26.139987 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING" triton-server | I0924 12:01:26.140004 1 tensorrt.cc:390] "model still-classifier, instance still-classifier_0_0, executing 1 requests" triton-server | I0924 12:01:26.140010 1 instance_state.cc:359] "TRITONBACKEND_ModelExecute: Issuing still-classifier_0_0 with 1 requests" triton-server | I0924 12:01:26.140016 1 instance_state.cc:408] "TRITONBACKEND_ModelExecute: Running still-classifier_0_0 with 1 requests" triton-server | I0924 12:01:26.140046 1 instance_state.cc:1431] "Optimization profile default [0] is selected for still-classifier_0_0" triton-server | I0924 12:01:26.140066 1 pinned_memory_manager.cc:198] "pinned memory allocation: size 602112, addr 0x7fa0b07a0100" triton-server | I0924 12:01:26.140144 1 instance_state.cc:914] "Context with profile default [0] is being executed for still-classifier_0_0" triton-server | I0924 12:01:26.231847 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:26.231861 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:26.231869 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:26.231895 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:26.231904 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feaca39d0] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feaca2358] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feaca2358] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:26.231923 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:26.428465 1 http_server.cc:4580] "HTTP request: 2 /v2/models/lama-inpaint-1200/versions/1/infer" triton-server | I0924 12:01:26.428479 1 model_lifecycle.cc:339] "GetModel() 'lama-inpaint-1200' version 1" triton-server | I0924 12:01:26.428486 1 model_lifecycle.cc:339] "GetModel() 'lama-inpaint-1200' version 1" triton-server | I0924 12:01:26.428508 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:26.428517 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feaca50f0] request id: , model: lama-inpaint-1200, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feaca4668] input: input0, type: FP32, original shape: [1,4,1200,1200], batch + shape: [1,4,1200,1200], shape: [4,1200,1200]\noverride inputs:\ninputs:\n[0x0x7f9feaca4668] input: input0, type: FP32, original shape: [1,4,1200,1200], batch + shape: [1,4,1200,1200], shape: [4,1200,1200]\noriginal requested outputs:\nrequested outputs:\noutput0\n" triton-server | I0924 12:01:26.428535 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:26.428570 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING" triton-server | I0924 12:01:26.428596 1 tensorrt.cc:390] "model lama-inpaint-1200, instance lama-inpaint-1200_0_0, executing 1 requests" triton-server | I0924 12:01:26.428604 1 instance_state.cc:359] "TRITONBACKEND_ModelExecute: Issuing lama-inpaint-1200_0_0 with 1 requests" triton-server | I0924 12:01:26.428611 1 instance_state.cc:408] "TRITONBACKEND_ModelExecute: Running lama-inpaint-1200_0_0 with 1 requests" triton-server | I0924 12:01:26.428639 1 instance_state.cc:1431] "Optimization profile default [0] is selected for lama-inpaint-1200_0_0" triton-server | I0924 12:01:26.428658 1 pinned_memory_manager.cc:198] "pinned memory allocation: size 23040000, addr 0x7fa0b0833110" triton-server | I0924 12:01:26.433693 1 instance_state.cc:914] "Context with profile default [0] is being executed for lama-inpaint-1200_0_0" triton-server | I0924 12:01:26.503961 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:26.503975 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:26.503982 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:26.504002 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:26.504008 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feaca6e20] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feaca5af8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feaca5af8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:26.504022 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:26.523967 1 infer_response.cc:174] "add response output: output: output0, type: FP32, shape: [1,3,1200,1200]" triton-server | I0924 12:01:26.523984 1 http_server.cc:1259] "HTTP: unable to provide 'output0' in GPU, will use CPU" triton-server | I0924 12:01:26.524016 1 http_server.cc:1279] "HTTP using buffer for: 'output0', size: 17280000, addr: 0x7f9189fff040" triton-server | I0924 12:01:26.524029 1 pinned_memory_manager.cc:198] "pinned memory allocation: size 17280000, addr 0x7fa0b1e2c120" triton-server | I0924 12:01:26.534385 1 http_server.cc:4580] "HTTP request: 2 /v2/models/camera-segmentation/versions/24/infer" triton-server | I0924 12:01:26.534397 1 model_lifecycle.cc:339] "GetModel() 'camera-segmentation' version 24" triton-server | I0924 12:01:26.534403 1 model_lifecycle.cc:339] "GetModel() 'camera-segmentation' version 24" triton-server | I0924 12:01:26.534417 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:26.534422 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feaca6720] request id: , model: camera-segmentation, requested version: 24, actual version: 24, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feaca7be8] input: input1, type: FP32, original shape: [1,3,1408,1408], batch + shape: [1,3,1408,1408], shape: [3,1408,1408]\noverride inputs:\ninputs:\n[0x0x7f9feaca7be8] input: input1, type: FP32, original shape: [1,3,1408,1408], batch + shape: [1,3,1408,1408], shape: [3,1408,1408]\noriginal requested outputs:\nrequested outputs:\noutput1\n" triton-server | I0924 12:01:26.534433 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:26.652851 1 http_server.cc:1353] "HTTP release: size 17280000, addr 0x7f9189fff040" triton-server | I0924 12:01:26.652884 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from EXECUTING to RELEASED" triton-server | I0924 12:01:26.652901 1 instance_state.cc:1314] "TRITONBACKEND_ModelExecute: model lama-inpaint-1200_0_0 released 1 requests" triton-server | I0924 12:01:26.652909 1 pinned_memory_manager.cc:226] "pinned memory deallocation: addr 0x7fa0b1e2c120" triton-server | I0924 12:01:26.652918 1 pinned_memory_manager.cc:226] "pinned memory deallocation: addr 0x7fa0b0833110" triton-server | I0924 12:01:26.811605 1 http_server.cc:4580] "HTTP request: 2 /v2/models/window-crop-segmentation/versions/2/infer" triton-server | I0924 12:01:26.811620 1 model_lifecycle.cc:339] "GetModel() 'window-crop-segmentation' version 2" triton-server | I0924 12:01:26.811628 1 model_lifecycle.cc:339] "GetModel() 'window-crop-segmentation' version 2" triton-server | I0924 12:01:26.811647 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:26.811652 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac7a130] request id: , model: window-crop-segmentation, requested version: 2, actual version: 2, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feaca2598] input: input, type: FP32, original shape: [1,1408,1408,3], batch + shape: [1,1408,1408,3], shape: [1408,1408,3]\noverride inputs:\ninputs:\n[0x0x7f9feaca2598] input: input, type: FP32, original shape: [1,1408,1408,3], batch + shape: [1,1408,1408,3], shape: [1408,1408,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\n" triton-server | I0924 12:01:26.811665 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:27.032049 1 http_server.cc:4580] "HTTP request: 2 /v2/models/face-detector/versions/17/infer" triton-server | I0924 12:01:27.032063 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:27.032070 1 model_lifecycle.cc:339] "GetModel() 'face-detector' version 17" triton-server | I0924 12:01:27.032089 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED" triton-server | I0924 12:01:27.032094 1 infer_request.cc:905] "[request id: <id_unknown>] prepared: [0x0x7f9feac787d0] request id: , model: face-detector, requested version: 17, actual version: 17, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 60000\noriginal inputs:\n[0x0x7f9feaca56d8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noverride inputs:\ninputs:\n[0x0x7f9feaca56d8] input: data, type: UINT8, original shape: [8,1024,1024,3], batch + shape: [8,1024,1024,3], shape: [1024,1024,3]\noriginal requested outputs:\nrequested outputs:\nIdentity:0\nIdentity_1:0\nIdentity_2:0\nIdentity_3:0\nIdentity_4:0\nIdentity_5:0\nIdentity_6:0\nIdentity_7:0\nIdentity_8:0\n" triton-server | I0924 12:01:27.032107 1 infer_request.cc:132] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING" triton-server | I0924 12:01:27.135319 1 http_server.cc:4580] "HTTP request: 2 /v2/models/lama-inpaint-1200/versions/1/infer" triton-server | I0924 12:01:27.135333 1 model_lifecycle.cc:339] "GetModel() 'lama-inpaint-1200' version 1" triton-server | I0924 12:01:27.135340 1 model_lifecycle.cc:339] "GetModel() 'lama-inpaint-1200' version 1"

Expected behavior A clear and concise description of what you expected to happen.

NVIDIA - please fix.

Sep 24 '24 12:09 appearancefnp

Thanks @appearancefnp , I've created an internal ticket for the team to investigate. What are types of the models you use? So that we could reproduce easily

Sep 27 '24 17:09 oandreeva-nv

@appearancefnp, If possible could you please share the issue repro models? Could you also try running on Triton Server 24.04 (before the TensorRT 10 upgrade) and confirm whether the issue still occurs? Thank you.

Sep 27 '24 18:09 pskiran1

@oandreeva-nv I use mostly CNN models, 2 Transformer based models. Quite a lot of them. These issues were not present in Triton Server 24.01 (haven't tested other versions). @pskiran1 How can we share the models securely?

Sep 30 '24 06:09 appearancefnp

These issues were not present in Triton Server 24.01 (haven't tested other versions).

@appearancefnp, thanks for letting us know. We believe this issue is similar to a known TensorRT issue introduced in 24.05. Let us confirm it by trying from our end.

How can we share the models securely?

I believe you can share it via Google Drive with no public access. We can request access, and once we notify you that we have downloaded the models, you can delete those from Google Drive to avoid further access requests. Please share issue reproduction models, client, and necessary steps with us. Thank you.

Sep 30 '24 07:09 pskiran1

Hello,

We have a similar issue starting Triton 24.05 It's reassuring to see we are not the only ones with this issue. We have already filed an NVBug, but I will post here some details about our setup in case that can be of use to the author of this issue.

NVIDIA Bug ID:4765405

Bug Description We are encountering an inference hang issue when deploying models using TensorRT with Triton Server. The key details are:

Affected Models:

Vae Encoder of the SdV2
Denoiser of Sd V2
Other non-diffusion models also experience similar issues.

Symptoms: After handling a few successful inference requests, GPU utilization spikes to 100%, and the server becomes unresponsive. Inference requests remain in the PENDING/EXECUTING state without progressing. The issue is reproducible with Triton Server versions 24.05 and later (incorporating TensorRT 10.3 and 10.5), while earlier versions do not exhibit this problem. Deploying multiple instances of the same Python model on a single GPU worsens the issue. With Triton Server 24.07 and TensorRT 10.3, the problem manifests more quickly.

Reproducible Steps Environment Setup:

Utilize Triton Server version 24.05 or later (with TensorRT versions 10.3 or 10.5).
Deploy the SdV2 Vae encoder and denoiser as tensorrt models using Triton Server containers (e.g., NGC 24.07).
Implement two simple BLS models that run image encoding to latent and diffusion steps.

Deployment Configuration:

Run one Triton container per GPU.
Deploy multiple instances of the problematic Python models on each GPU.

Model Invocation:

Send inference requests to the deployed models.
Observe that after 2-3 retries or several minutes, the server hangs with GPU utilization maxed out.

Workaround Attempts:

Reduce --builderOptimizationLevel to 1: Attempt lowering optimization to mitigate the issue, though it may lead to performance regression, but this leads to unacceptable latencies in production.
Set CUDA_LAUNCH_BLOCKING=1: Enable synchronous CUDA operations to help identify CUDA errors, potentially degrading performance.Same as above this leads to unacceptable latencies

In our case the issue seems to come from TensorRT and we have been told that they are working on fixing the bug.

Sep 30 '24 16:09 MatthieuToulemont

@appearancefnp, @MatthieuToulemont, Tritonserver 24.11 is now available and includes TensorRT 10.6, which addresses the above issue. Please let us know if you continue to experience the same issue.

Nov 26 '24 07:11 pskiran1

@pskiran1 Hey! We updated our production servers to 24.11, but the issue still remains :(

Dec 02 '24 12:12 appearancefnp

@appearancefnp @pskiran1 @MatthieuToulemont can you please confirm if this issue can be resolved by upgrading to 24.11 ? Facing same issue in 24.07

Jun 28 '25 19:06 jayakommuru