tensorrtllm_backend CUDA runtime error in cudaDeviceGetDefaultMemPool

Description Trying to deploy a HugginFace model, which I successfully converted with TensorRT-LLM (i.e. inference with model engines works in the TRT-LLM container), in Triton Server with tensorrtllm_backend, I always get a CUDA runtime error in cudaDeviceGetDefaultMemPool.

System Ubuntu 22.04.4 LTS Driver Version: 535.104.05 CUDA Version: 12.2

➜  tensorrtllm_backend git:(release/0.5.0) ✗ nvidia-smi
Tue Mar  5 19:47:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100DX-80C                On  | 00000000:02:04.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  GRID A100DX-80C                On  | 00000000:02:05.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  GRID A100DX-80C                On  | 00000000:02:06.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  GRID A100DX-80C                On  | 00000000:02:07.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Triton Information nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 Tried both the container from NGC and building it from the tensorrtllm_backend repo. Same behavior. I also tried newer versions and different driver/CUDA versions, always same behavior.

To Reproduce I followed exactly this: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/ with Llama-2-13b-chat-hf. Configuration etc. from the tutorial. Running the model engines in the TensorRT-LLM container works fine (I can see activity on all 4 GPUs when I call nvidia-smi during inference). When I try to run Triton server I get the following error:

Expected behavior Something like this:

+----------------------+---------+--------+
| Model                | Version | Status |
+----------------------+---------+--------+
| <model_name>         | <v>     | READY  |
| ..                   | .       | ..     |
| ..                   | .       | ..     |
+----------------------+---------+--------+
...

Actual behavior


root@hal3000:/opt/tritonserver# python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 4
root@hal3000:/opt/tritonserver# I0305 19:29:53.282831 117 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x10020000000' with size 268435456
I0305 19:29:53.289096 116 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x10020000000' with size 268435456
I0305 19:29:53.295001 119 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x10020000000' with size 268435456
I0305 19:29:53.300931 118 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x10020000000' with size 268435456
I0305 19:29:53.306811 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0305 19:29:53.306831 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0305 19:29:53.306834 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0305 19:29:53.306836 117 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0305 19:29:53.306827 116 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0305 19:29:53.306834 116 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0305 19:29:53.306836 116 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0305 19:29:53.306838 116 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0305 19:29:53.306867 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0305 19:29:53.306876 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0305 19:29:53.306878 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0305 19:29:53.306880 119 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0305 19:29:53.307080 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0305 19:29:53.307099 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0305 19:29:53.307101 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0305 19:29:53.307103 118 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0305 19:29:54.379254 117 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0305 19:29:54.379321 117 model_lifecycle.cc:461] loading: preprocessing:1
I0305 19:29:54.379370 117 model_lifecycle.cc:461] loading: postprocessing:1
I0305 19:29:54.393911 116 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0305 19:29:54.393956 116 model_lifecycle.cc:461] loading: preprocessing:1
I0305 19:29:54.393986 116 model_lifecycle.cc:461] loading: postprocessing:1
I0305 19:29:54.398459 119 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0305 19:29:54.398457 118 model_lifecycle.cc:461] loading: tensorrt_llm:1
I0305 19:29:54.398534 119 model_lifecycle.cc:461] loading: preprocessing:1
I0305 19:29:54.398537 118 model_lifecycle.cc:461] loading: preprocessing:1
I0305 19:29:54.398585 118 model_lifecycle.cc:461] loading: postprocessing:1
I0305 19:29:54.398615 119 model_lifecycle.cc:461] loading: postprocessing:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I0305 19:29:54.423351 117 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0305 19:29:54.423376 117 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I0305 19:29:54.433686 116 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0305 19:29:54.433683 116 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I0305 19:29:54.447732 119 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0305 19:29:54.447770 119 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0305 19:29:54.448162 118 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0305 19:29:54.448289 118 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
Downloading tokenizer_config.json: 100%|██████████| 1.62k/1.62k [00:00<00:00, 7.54MB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 8.91MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 414/414 [00:00<00:00, 3.33MB/s]
Downloading tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 10.1MB/s]
I0305 19:29:56.596386 117 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I0305 19:29:56.613390 118 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I0305 19:29:56.616850 119 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I0305 19:29:56.627085 117 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I0305 19:29:56.628228 116 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I0305 19:29:56.666116 118 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I0305 19:29:56.732012 119 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I0305 19:29:56.793750 116 model_lifecycle.cc:818] successfully loaded 'postprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
E0305 19:29:57.029880 119 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:170)
1       0x7f981281e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f981281e045]
2       0x7f9812873925 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8b925) [0x7f9812873925]
3       0x7f98128747bf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8c7bf) [0x7f98128747bf]
4       0x7f98128d5545 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed545) [0x7f98128d5545]
5       0x7f981284ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f981284ef4e]
6       0x7f981283ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f981283ec0c]
7       0x7f98128395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f98128395f5]
8       0x7f98128374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f98128374db]
9       0x7f981281b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f981281b182]
10      0x7f981281b235 TRITONBACKEND_ModelInstanceInitialize + 101
11      0x7f989ab9aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f989ab9aa86]
12      0x7f989ab9bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f989ab9bcc6]
13      0x7f989ab7ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f989ab7ec15]
14      0x7f989ab7f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f989ab7f256]
15      0x7f989ab8b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f989ab8b27d]
16      0x7f989a1f9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f989a1f9ee8]
17      0x7f989ab7597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f989ab7597b]
18      0x7f989ab85695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f989ab85695]
19      0x7f989ab8a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f989ab8a50b]
20      0x7f989ac73610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f989ac73610]
21      0x7f989ac76d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f989ac76d03]
22      0x7f989adc38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f989adc38b2]
23      0x7f989a464253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f989a464253]
24      0x7f989a1f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f989a1f4ac3]
25      0x7f989a285bf4 clone + 68
E0305 19:29:57.030002 119 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:170)
1       0x7f981281e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f981281e045]
2       0x7f9812873925 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8b925) [0x7f9812873925]
3       0x7f98128747bf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8c7bf) [0x7f98128747bf]
4       0x7f98128d5545 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed545) [0x7f98128d5545]
5       0x7f981284ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f981284ef4e]
6       0x7f981283ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f981283ec0c]
7       0x7f98128395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f98128395f5]
8       0x7f98128374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f98128374db]
9       0x7f981281b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f981281b182]
10      0x7f981281b235 TRITONBACKEND_ModelInstanceInitialize + 101
11      0x7f989ab9aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f989ab9aa86]
12      0x7f989ab9bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f989ab9bcc6]
13      0x7f989ab7ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f989ab7ec15]
14      0x7f989ab7f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f989ab7f256]
15      0x7f989ab8b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f989ab8b27d]
16      0x7f989a1f9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f989a1f9ee8]
17      0x7f989ab7597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f989ab7597b]
18      0x7f989ab85695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f989ab85695]
19      0x7f989ab8a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f989ab8a50b]
20      0x7f989ac73610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f989ac73610]
21      0x7f989ac76d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f989ac76d03]
22      0x7f989adc38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f989adc38b2]
23      0x7f989a464253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f989a464253]
24      0x7f989a1f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f989a1f4ac3]
25      0x7f989a285bf4 clone + 68
I0305 19:29:57.030034 119 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E0305 19:29:57.030114 119 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:170)
1       0x7f981281e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f981281e045]
2       0x7f9812873925 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8b925) [0x7f9812873925]
3       0x7f98128747bf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8c7bf) [0x7f98128747bf]
4       0x7f98128d5545 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed545) [0x7f98128d5545]
5       0x7f981284ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f981284ef4e]
6       0x7f981283ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f981283ec0c]
7       0x7f98128395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f98128395f5]
8       0x7f98128374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f98128374db]
9       0x7f981281b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f981281b182]
10      0x7f981281b235 TRITONBACKEND_ModelInstanceInitialize + 101
11      0x7f989ab9aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f989ab9aa86]
12      0x7f989ab9bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f989ab9bcc6]
13      0x7f989ab7ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f989ab7ec15]
14      0x7f989ab7f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f989ab7f256]
15      0x7f989ab8b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f989ab8b27d]
16      0x7f989a1f9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f989a1f9ee8]
17      0x7f989ab7597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f989ab7597b]
18      0x7f989ab85695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f989ab85695]
19      0x7f989ab8a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f989ab8a50b]
20      0x7f989ac73610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f989ac73610]
21      0x7f989ac76d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f989ac76d03]
22      0x7f989adc38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f989adc38b2]
23      0x7f989a464253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f989a464253]
24      0x7f989a1f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f989a1f4ac3]
25      0x7f989a285bf4 clone + 68;
I0305 19:29:57.030195 119 server.cc:592] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0305 19:29:57.030246 119 server.cc:619] 
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                       |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.0000 |
|             |                                                                 | 00","default-max-batch-size":"4"}}                                                                                           |
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.0000 |
|             |                                                                 | 00","shm-region-prefix-name":"prefix3_","default-max-batch-size":"4"}}                                                       |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+

I0305 19:29:57.030338 119 server.cc:662] 
+----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model          | Version | Status                                                                                                                                                                            |
+----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1       | READY                                                                                                                                                                             |
| preprocessing  | 1       | READY                                                                                                                                                                             |
| tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation no |
|                |         | t supported (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:170)                                                                                                    |
|                |         | 1       0x7f981281e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f981281e045]                                                                 |
|                |         | 2       0x7f9812873925 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8b925) [0x7f9812873925]                                                                 |
|                |         | 3       0x7f98128747bf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8c7bf) [0x7f98128747bf]                                                                 |
|                |         | 4       0x7f98128d5545 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed545) [0x7f98128d5545]                                                                 |
|                |         | 5       0x7f981284ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f981284ef4e]                                                                 |
|                |         | 6       0x7f981283ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f981283ec0c]                                                                 |
|                |         | 7       0x7f98128395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f98128395f5]                                                                 |
|                |         | 8       0x7f98128374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f98128374db]                                                                 |
|                |         | 9       0x7f981281b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f981281b182]                                                                 |
|                |         | 10      0x7f981281b235 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                |
|                |         | 11      0x7f989ab9aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f989ab9aa86]                                                                                |
|                |         | 12      0x7f989ab9bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f989ab9bcc6]                                                                                |
|                |         | 13      0x7f989ab7ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f989ab7ec15]                                                                                |
|                |         | 14      0x7f989ab7f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f989ab7f256]                                                                                |
|                |         | 15      0x7f989ab8b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f989ab8b27d]                                                                                |
|                |         | 16      0x7f989a1f9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f989a1f9ee8]                                                                                             |
|                |         | 17      0x7f989ab7597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f989ab7597b]                                                                                |
|                |         | 18      0x7f989ab85695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f989ab85695]                                                                                |
|                |         | 19      0x7f989ab8a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f989ab8a50b]                                                                                |
|                |         | 20      0x7f989ac73610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f989ac73610]                                                                                |
|                |         | 21      0x7f989ac76d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f989ac76d03]                                                                                |
|                |         | 22      0x7f989adc38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f989adc38b2]                                                                                |
|                |         | 23      0x7f989a464253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f989a464253]                                                                                        |
|                |         | 24      0x7f989a1f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f989a1f4ac3]                                                                                             |
|                |         | 25      0x7f989a285bf4 clone + 68                                                                                                                                                 |
+----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0305 19:29:57.082829 119 metrics.cc:817] Collecting metrics for GPU 0: GRID A100DX-80C
I0305 19:29:57.082854 119 metrics.cc:817] Collecting metrics for GPU 1: GRID A100DX-80C
I0305 19:29:57.082861 119 metrics.cc:817] Collecting metrics for GPU 2: GRID A100DX-80C
I0305 19:29:57.082865 119 metrics.cc:817] Collecting metrics for GPU 3: GRID A100DX-80C
I0305 19:29:57.083066 119 metrics.cc:710] Collecting CPU metrics
I0305 19:29:57.083220 119 tritonserver.cc:2458] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                    |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                   |
| server_version                   | 2.39.0                                                                                                                                                                   |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_d |
|                                  | ata parameters statistics trace logging                                                                                                                                  |
| model_repository_path[0]         | /all_models/inflight_batcher_llm                                                                                                                                         |
| model_control_mode               | MODE_NONE                                                                                                                                                                |
| strict_model_config              | 1                                                                                                                                                                        |
| rate_limit                       | OFF                                                                                                                                                                      |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                 |
| min_supported_compute_capability | 6.0                                                                                                                                                                      |
| strict_readiness                 | 1                                                                                                                                                                        |
| exit_timeout                     | 30                                                                                                                                                                       |
| cache_enabled                    | 0                                                                                                                                                                        |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0305 19:29:57.083232 119 server.cc:293] Waiting for in-flight requests to complete.
I0305 19:29:57.083237 119 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I0305 19:29:57.083326 119 server.cc:324] All models are stopped, unloading models
I0305 19:29:57.083333 119 server.cc:331] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
E0305 19:29:57.373556 116 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:170)
1       0x7f17be81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f17be81e045]
2       0x7f17be873925 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8b925) [0x7f17be873925]
3       0x7f17be8747bf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8c7bf) [0x7f17be8747bf]
4       0x7f17be8d5545 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed545) [0x7f17be8d5545]
5       0x7f17be84ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f17be84ef4e]
6       0x7f17be83ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f17be83ec0c]
7       0x7f17be8395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f17be8395f5]
8       0x7f17be8374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f17be8374db]
9       0x7f17be81b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f17be81b182]
10      0x7f17be81b235 TRITONBACKEND_ModelInstanceInitialize + 101
11      0x7f183559aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f183559aa86]
12      0x7f183559bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f183559bcc6]
13      0x7f183557ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f183557ec15]
14      0x7f183557f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f183557f256]
15      0x7f183558b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f183558b27d]
16      0x7f1834bf9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f1834bf9ee8]
17      0x7f183557597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f183557597b]
18      0x7f1835585695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f1835585695]
19      0x7f183558a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f183558a50b]
20      0x7f1835673610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f1835673610]
21      0x7f1835676d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f1835676d03]
22      0x7f18357c38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f18357c38b2]
23      0x7f1834e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f1834e64253]
24      0x7f1834bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1834bf4ac3]
25      0x7f1834c85bf4 clone + 68
E0305 19:29:57.373760 116 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:170)
1       0x7f17be81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f17be81e045]
2       0x7f17be873925 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8b925) [0x7f17be873925]
3       0x7f17be8747bf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8c7bf) [0x7f17be8747bf]
4       0x7f17be8d5545 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed545) [0x7f17be8d5545]
5       0x7f17be84ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f17be84ef4e]
6       0x7f17be83ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f17be83ec0c]
7       0x7f17be8395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f17be8395f5]
8       0x7f17be8374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f17be8374db]
9       0x7f17be81b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f17be81b182]
10      0x7f17be81b235 TRITONBACKEND_ModelInstanceInitialize + 101
11      0x7f183559aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f183559aa86]
12      0x7f183559bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f183559bcc6]
13      0x7f183557ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f183557ec15]
14      0x7f183557f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f183557f256]
15      0x7f183558b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f183558b27d]
16      0x7f1834bf9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f1834bf9ee8]
17      0x7f183557597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f183557597b]
18      0x7f1835585695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f1835585695]
19      0x7f183558a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f183558a50b]
20      0x7f1835673610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f1835673610]
21      0x7f1835676d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f1835676d03]
22      0x7f18357c38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f18357c38b2]
23      0x7f1834e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f1834e64253]
24      0x7f1834bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1834bf4ac3]
25      0x7f1834c85bf4 clone + 68
I0305 19:29:57.373813 116 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E0305 19:29:57.373955 116 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:170)
1       0x7f17be81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f17be81e045]
2       0x7f17be873925 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8b925) [0x7f17be873925]
3       0x7f17be8747bf /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x8c7bf) [0x7f17be8747bf]
4       0x7f17be8d5545 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed545) [0x7f17be8d5545]
5       0x7f17be84ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f17be84ef4e]
6       0x7f17be83ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f17be83ec0c]
7       0x7f17be8395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f17be8395f5]
8       0x7f17be8374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f17be8374db]
9       0x7f17be81b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f17be81b182]
10      0x7f17be81b235 TRITONBACKEND_ModelInstanceInitialize + 101
11      0x7f183559aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f183559aa86]
12      0x7f183559bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f183559bcc6]
13      0x7f183557ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f183557ec15]
14      0x7f183557f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f183557f256]
15      0x7f183558b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f183558b27d]
16      0x7f1834bf9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f1834bf9ee8]
17      0x7f183557597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f183557597b]
18      0x7f1835585695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f1835585695]
19      0x7f183558a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f183558a50b]
20      0x7f1835673610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f1835673610]
21      0x7f1835676d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f1835676d03]
22      0x7f18357c38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f18357c38b2]
23      0x7f1834e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f1834e64253]
24      0x7f1834bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1834bf4ac3]
25      0x7f1834c85bf4 clone + 68;
I0305 19:29:57.374069 116 server.cc:592] 

...


+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                    |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                   |
| server_version                   | 2.39.0                                                                                                                                                                   |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_d |
|                                  | ata parameters statistics trace logging                                                                                                                                  |
| model_repository_path[0]         | /all_models/inflight_batcher_llm                                                                                                                                         |
| model_control_mode               | MODE_NONE                                                                                                                                                                |
| strict_model_config              | 1                                                                                                                                                                        |
| rate_limit                       | OFF                                                                                                                                                                      |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                 |
| min_supported_compute_capability | 6.0                                                                                                                                                                      |
| strict_readiness                 | 1                                                                                                                                                                        |
| exit_timeout                     | 30                                                                                                                                                                       |
| cache_enabled                    | 0                                                                                                                                                                        |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0305 19:29:58.617065 118 server.cc:293] Waiting for in-flight requests to complete.
I0305 19:29:58.617072 118 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I0305 19:29:58.617154 118 server.cc:324] All models are stopped, unloading models
I0305 19:29:58.617163 118 server.cc:331] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I0305 19:29:58.639565 119 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1
I0305 19:29:58.732184 116 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1
I0305 19:29:58.736228 117 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1
I0305 19:29:58.915008 117 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1
I0305 19:29:58.922722 116 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1
I0305 19:29:59.083530 119 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
W0305 19:29:59.089428 119 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:29:59.089462 119 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:29:59.089469 119 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:29:59.089479 119 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:29:59.089485 119 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:29:59.089490 119 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:29:59.089497 119 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:29:59.089502 119 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:29:59.089506 119 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:29:59.089513 119 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:29:59.089518 119 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:29:59.089523 119 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
error: creating server: Internal - failed to load all models
I0305 19:29:59.427537 116 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
W0305 19:29:59.433040 116 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:29:59.433087 116 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:29:59.433095 116 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:29:59.433105 116 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:29:59.433115 116 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:29:59.433121 116 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:29:59.433129 116 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:29:59.433135 116 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:29:59.433141 116 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:29:59.433148 116 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:29:59.433154 116 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:29:59.433161 116 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
I0305 19:29:59.470397 117 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
error: creating server: Internal - failed to load all models
W0305 19:29:59.524316 117 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:29:59.524353 117 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:29:59.524357 117 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:29:59.524364 117 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:29:59.524369 117 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:29:59.524372 117 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:29:59.524378 117 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:29:59.524380 117 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:29:59.524383 117 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:29:59.524390 117 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:29:59.524394 117 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:29:59.524396 117 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
I0305 19:29:59.617248 118 server.cc:331] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
Cleaning up...
Cleaning up...
W0305 19:29:59.625442 118 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:29:59.625467 118 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:29:59.625472 118 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:29:59.625481 118 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:29:59.625485 118 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:29:59.625488 118 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:29:59.625492 118 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:29:59.625496 118 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:29:59.625499 118 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:29:59.625503 118 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:29:59.625506 118 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:29:59.625509 118 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
I0305 19:29:59.954424 118 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1
W0305 19:30:00.094796 119 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:30:00.094820 119 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:30:00.094824 119 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:30:00.094829 119 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:30:00.094832 119 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:30:00.094834 119 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:30:00.094840 119 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:30:00.094843 119 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:30:00.094846 119 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:30:00.094849 119 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:30:00.094852 119 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:30:00.094855 119 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
I0305 19:30:00.368031 118 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1
W0305 19:30:00.438892 116 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:30:00.438937 116 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:30:00.438941 116 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:30:00.438946 116 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:30:00.438949 116 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:30:00.438953 116 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:30:00.438959 116 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:30:00.438965 116 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:30:00.438968 116 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:30:00.438973 116 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:30:00.438976 116 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:30:00.438979 116 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
W0305 19:30:00.529680 117 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:30:00.529750 117 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:30:00.529754 117 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:30:00.529762 117 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:30:00.529765 117 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:30:00.529769 117 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:30:00.529774 117 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:30:00.529778 117 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:30:00.529785 117 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:30:00.529790 117 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:30:00.529792 117 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:30:00.529795 117 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
I0305 19:30:00.617350 118 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
W0305 19:30:00.625876 118 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:30:00.625898 118 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:30:00.625902 118 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:30:00.625906 118 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:30:00.625908 118 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:30:00.625910 118 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:30:00.625915 118 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:30:00.625918 118 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:30:00.625921 118 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:30:00.625930 118 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:30:00.625932 118 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:30:00.625936 118 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
error: creating server: Internal - failed to load all models
W0305 19:30:01.631565 118 metrics.cc:582] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0305 19:30:01.631621 118 metrics.cc:600] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0305 19:30:01.631625 118 metrics.cc:624] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0305 19:30:01.631632 118 metrics.cc:582] Unable to get power limit for GPU 1. Status:Success, value:0.000000
W0305 19:30:01.631635 118 metrics.cc:600] Unable to get power usage for GPU 1. Status:Success, value:0.000000
W0305 19:30:01.631639 118 metrics.cc:624] Unable to get energy consumption for GPU 1. Status:Success, value:0
W0305 19:30:01.631643 118 metrics.cc:582] Unable to get power limit for GPU 2. Status:Success, value:0.000000
W0305 19:30:01.631647 118 metrics.cc:600] Unable to get power usage for GPU 2. Status:Success, value:0.000000
W0305 19:30:01.631649 118 metrics.cc:624] Unable to get energy consumption for GPU 2. Status:Success, value:0
W0305 19:30:01.631653 118 metrics.cc:582] Unable to get power limit for GPU 3. Status:Success, value:0.000000
W0305 19:30:01.631657 118 metrics.cc:600] Unable to get power usage for GPU 3. Status:Success, value:0.000000
W0305 19:30:01.631660 118 metrics.cc:624] Unable to get energy consumption for GPU 3. Status:Success, value:0
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20086,1],3]
  Exit code:    1
--------------------------------------------------------------------------

Mar 05 '24 19:03 tobernat

Running into the exact same issue

Mar 09 '24 21:03 vhojan

In our case it was solved by setting vGPU plugin parameters in VMware: https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-vmware-vsphere see also here: https://kb.vmware.com/s/article/2142307

Mar 15 '24 13:03 tobernat

We're seeing this on Azure with A10 GPUs. Does MIG prevent using this cuda function?

Mar 18 '24 05:03 lynkz-matt-psaltis

Facing similar issues as well, using L40 vGPU (l40_48c profile) running on a VMware ubuntu 22.04 VM.

Apr 12 '24 07:04 yxchia98

In our case, it was solved as well by setting the right VM params: https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-vmware-vsphere

Apr 12 '24 09:04 vhojan

In our case, it was solved as well by setting the right VM params: https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-vmware-vsphere

May I know what VM params did you change to fix this? I've added pciPassthru.use64bitMMIO to TRUE and pciPassthru.64bitMMIOSizeGB to 128 and it still didn't work.

Apr 12 '24 13:04 yxchia98

Same issue on A40 using vGPU on ESXI 7. Can someone let us know which parameter can fix it? The 2 mentioned by @yxchia98 are not enough. @vhojan @tobernat

May 23 '24 10:05 stephanbertl

I guess you need to set enable_uvm to 1? (edit: worked for me)

Jul 05 '24 16:07 radagasus

Setting enable_uvm to 1 wasn't sufficient in our case. Does anyone have another solution?

Jul 17 '24 13:07 Metin-Usta

Setting enable_uvm to 1 wasn't sufficient in my case too.

Oct 24 '24 08:10 tanpham380