server
server copied to clipboard
TritonModelException: Failed to open the cudaIpcHandle when I call the pytorch model upon wsl2 ubuntu 22.04
Hello everyone! I am calling inference requests on pytorch models in the Python backend file, appearing the bug that TritonModelException: Failed to open the cudaIpcHandle. I also change the version of ubuntu 20.04 LTS, the same bug is occurred. The enviroment
- GPU: RTX GEFORCE 3060 (single one)
- cuda:12.0
- container: nvcr.io/nvidia/tritonserver:23.07-py3
- Ubuntu 22.04 LTS / Ubuntu 20.04 LTS ( both have tried)
pytorch model config file (fc_model_pt)
name: "fc_model_pt"
platform: "pytorch_libtorch"
max_batch_size : 64
input [
{
name: "input__0"
data_type: TYPE_INT64
dims: [ -1 ] #
}
]
output [
{
name: "output__0" # 命名规范同输入
data_type: TYPE_FP32
dims: [ -1, -1, 4 ]
},
{
name: "output__1"
data_type: TYPE_FP32
dims: [ -1, -1, 8 ]
}
]
python backend config file (model.py)
import json
import numpy as np
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
"""Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""
def initialize(self, args):
"""`initialize` is called only once when the model is being loaded.
Implementing `initialize` function is optional. This function allows
the model to intialize any state associated with this model.
Parameters
----------
args : dict
Both keys and values are strings. The dictionary keys and values are:
* model_config: A JSON string containing the model configuration
* model_instance_kind: A string containing model instance kind
* model_instance_device_id: A string containing model instance device ID
* model_repository: Model repository path
* model_version: Model version
* model_name: Model name
"""
# You must parse model_config. JSON string is not parsed here
self.model_config = model_config = json.loads(args['model_config'])
# Get output__0 configuration
output0_config = pb_utils.get_output_config_by_name(
model_config, "output__0")
# Get output__1 configuration
output1_config = pb_utils.get_output_config_by_name(
model_config, "output__1")
# Convert Triton types to numpy types
self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])
self.output1_dtype = pb_utils.triton_string_to_numpy(output1_config['data_type'])
def execute(self, requests):
"""
requests : list
A list of pb_utils.InferenceRequest
Returns
-------
list
A list of pb_utils.InferenceResponse. The length of this list must
be the same as `requests`
"""
output0_dtype = self.output0_dtype
output1_dtype = self.output1_dtype
responses = []
# Every Python backend must iterate over everyone of the requests
# and create a pb_utils.InferenceResponse for each of them.
for request in requests:
in_0 = pb_utils.get_input_tensor_by_name(request, "input__0")
# fake data
out_0 = np.array([1, 2, 3, 4, 5, 6, 7, 8]) # fixed for convenient
out_tensor_0 = pb_utils.Tensor("output__0", out_0.astype(output0_dtype))
# The second data is acquired by the fc_model_pt
inference_request = pb_utils.InferenceRequest(
model_name='fc_model_pt',
requested_output_names=['output__0', 'output__1'],
inputs=[in_0])
inference_response = inference_request.exec() # the bug is reported in this line
out_tensor_1 = pb_utils.get_output_tensor_by_name(inference_response, 'output__1')
inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0, out_tensor_1],
error=pb_utils.TritonError(' Inference Request occur error'))
responses.append(inference_response)
return responses
def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is OPTIONAL. This function allows
the model to perform any necessary clean ups before exit.
"""
print('Cleaning up...')
python backend config file
name: "custom_model"
backend: "python"
input [
{
name: "input__0"
data_type: TYPE_INT64
dims: [ -1, -1 ]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP32
dims: [ -1, -1, 4 ]
},
{
name: "output__1"
data_type: TYPE_FP32
dims: [ -1, -1, 8 ]
}
]
The client file
import requests
if __name__ == "__main__":
request_data = {
"inputs": [{
"name": "input__0",
"shape": [1, 2],
"datatype": "INT64",
"data": [[1, 2]]
}],
"outputs": [{"name": "output__0"}, {"name": "output__1"}]
}
res = requests.post(url="http://localhost:8000/v2/models/fc_model_pt/versions/1/infer",json=request_data)
print(res.test) # this line can get the result <Response [200] >
res = requests.post(url="http://localhost:8000/v2/models/custom_model/versions/1/infer",json=request_data)
print(res) # this line message <Response [400] >
print(res.text) # {"error":"Failed to process the request(s) for model instance 'custom_model', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error\n\nAt:\n /models/custom_model/1/model.py(96): execute\n"}
another client file
import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import *
def main():
verbose = False
url = 'localhost:8000'
request_count = 1000 #
model_name = 'fc_model_pt'
model_name1 = 'custom_model'
try:
# Need to specify large enough concurrency to issue all the
# inference requests to the server in parallel.
triton_client = httpclient.InferenceServerClient(
url=url, verbose=verbose, )
# triton_client1 = httpclient.InferenceServerClient(
# url=url, verbose=verbose, concurrency=request_count, max_greenlets=None)
except Exception as e:
print("context creation failed: " + str(e))
sys.exit()
input0_data = np.array([[1,2]]).astype(np.int64)
inputs = [httpclient.InferInput(name='input__0',
shape=[1,2],
datatype="INT64")]
inputs[0].set_data_from_numpy(input0_data)
outputs =[]
outputs.append( httpclient.InferRequestedOutput(name='output__0',
binary_data=True))
outputs.append( httpclient.InferRequestedOutput(name='output__1',
binary_data=True))
response = triton_client.infer(model_name=model_name1,
inputs=inputs,
outputs=outputs)
# print(response.get_output('output__0'))
output__0 = response.as_numpy('output__0')
output__1 = response.as_numpy('output__1')
print('output 0 :',output__0)
print('output 1', output__1)
triton_client.close()
return
if __name__ == "__main__":
main()
Bug
tritonclient.utils.InferenceServerException: [400] Failed to process the request(s) for model instance 'custom_model', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error
I speculate this bug is involved environment. but I don't know the concrete the reason. Hope to help, thanks for a lot.
Hi @youwan114, can you share the output of nvidia-smi
to check the CUDA driver version? I remember we have seen similar issue before, and upgrading the CUDA driver version helped.
Hi @youwan114, can you share the output of
nvidia-smi
to check the CUDA driver version? I remember we have seen similar issue before, and upgrading the CUDA driver version helped.
Hello @krishung5 Sure, I am very happy that you comment my issue. This is my output of nvidia-smi
I am grateful for your attention. Could you mind tell me how you solved this issue?
This is my tritonserver log with command tritonserver --log-verbose=1 ......
I0823 03:10:20.056693 14719 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I0823 03:10:33.767570 14719 http_server.cc:3452] HTTP request: 2 /v2/models/custom_model/infer
I0823 03:10:33.767715 14719 infer_request.cc:751] [request id: 1] prepared: [0x0x7f5d2c002fb0] request id: 1, model: custom_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f5d2c002a78] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
override inputs:
inputs:
[0x0x7f5d2c002a78] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1
I0823 03:10:33.767812 14719 python_be.cc:1263] model custom_model, instance custom_model, executing 1 requests
I0823 03:10:33.788719 14719 infer_request.cc:751] [request id: 2] prepared: [0x0x7f5f180015e0] request id: 2, model: fc_model_pt, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f5f18001b18] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f5f18001b18] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1
I0823 03:10:33.788848 14719 libtorch.cc:2666] model fc_model_pt, instance fc_model_pt, executing 1 requests
I0823 03:10:33.788875 14719 libtorch.cc:1224] TRITONBACKEND_ModelExecute: Running fc_model_pt with 1 requests
I0823 03:10:33.788986 14719 pinned_memory_manager.cc:162] pinned memory allocation: size 16, addr 0x205000090
I0823 03:10:35.424447 14719 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,2,4]
I0823 03:10:35.424560 14719 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0823 03:10:35.425885 14719 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x205000090
I0823 03:10:35.635865 14719 python_be.cc:2230] TRITONBACKEND_ModelInstanceExecute: model instance name custom_model released 1 requests
I also changed the better equipment, and the same result is obtained as above. In my opinion, the server side calls the pytorch model (fc_model_pt) successfully, but can not return the results, I am very look forward to any reply!
Continue to test, I have tried to use cpu version of pytorch model (fc_model_pt), the results have achieved, so I do think the problem is locate the CUDA.
For furthermore information:
- CPU info:
I0823 09:45:06.263810 72295 http_server.cc:3372] HTTP request: 2 /v2/models/custom_model/infer
I0823 09:45:06.263942 72295 infer_request.cc:729] [request id: 1] prepared: [0x0x7f211c009050] request id: 1, model: custom_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f211c008f78] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
override inputs:
inputs:
[0x0x7f211c008f78] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1
I0823 09:45:06.264078 72295 python_be.cc:1094] model custom_model, instance custom_model_0, executing 1 requests
hello world
input data <c_python_backend_utils.Tensor object at 0x7f107ea66ef0> <class 'c_python_backend_utils.Tensor'>
there is data
I0823 09:45:06.264829 72295 infer_request.cc:729] [request id: 2] prepared: [0x0x7f24300013f0] request id: 2, model: fc_model_pt, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f24300016b8] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f24300016b8] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1
I0823 09:45:06.264912 72295 libtorch.cc:2129] model fc_model_pt, instance fc_model_pt_0, executing 1 requests
I0823 09:45:06.264932 72295 libtorch.cc:988] TRITONBACKEND_ModelExecute: Running fc_model_pt_0 with 1 requests
I0823 09:45:06.265569 72295 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,2,4]
I0823 09:45:06.265598 72295 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0823 09:45:06.265996 72295 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [8]
I0823 09:45:06.266024 72295 http_server.cc:1118] HTTP using buffer for: 'output__0', size: 32, addr: 0x7f231c005450
I0823 09:45:06.266040 72295 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0823 09:45:06.266054 72295 http_server.cc:1118] HTTP using buffer for: 'output__1', size: 64, addr: 0x7f231c005e50
I0823 09:45:06.266089 72295 http_server.cc:1192] HTTP release: size 32, addr 0x7f231c005450
I0823 09:45:06.266103 72295 http_server.cc:1192] HTTP release: size 64, addr 0x7f231c005e50
I0823 09:45:06.266132 72295 python_be.cc:1980] TRITONBACKEND_ModelInstanceExecute: model instance name custom_model_0 released 1 requests
- GPU info:
I0823 09:48:50.309827 76851 http_server.cc:3372] HTTP request: 2 /v2/models/custom_model/infer
I0823 09:48:50.309920 76851 infer_request.cc:729] [request id: 1] prepared: [0x0x7f835c0047d0] request id: 1, model: custom_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f835c004298] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
override inputs:
inputs:
[0x0x7f835c004298] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1
I0823 09:48:50.310087 76851 python_be.cc:1094] model custom_model, instance custom_model_0, executing 1 requests
hello world
input data <c_python_backend_utils.Tensor object at 0x7fdd74a8f830> <class 'c_python_backend_utils.Tensor'>
there is data
I0823 09:48:50.356416 76851 infer_request.cc:729] [request id: 2] prepared: [0x0x7f88180013f0] request id: 2, model: fc_model_pt, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f88180016b8] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f88180016b8] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1
I0823 09:48:50.356578 76851 libtorch.cc:2129] model fc_model_pt, instance fc_model_pt_0, executing 1 requests
I0823 09:48:50.356608 76851 libtorch.cc:988] TRITONBACKEND_ModelExecute: Running fc_model_pt_0 with 1 requests
I0823 09:48:50.356763 76851 pinned_memory_manager.cc:161] pinned memory allocation: size 16, addr 0x204e00090
I0823 09:48:53.166461 76851 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,2,4]
I0823 09:48:53.166598 76851 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0823 09:48:53.166788 76851 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x204e00090
I0823 09:48:53.333064 76851 python_be.cc:1980] TRITONBACKEND_ModelInstanceExecute: model instance name custom_model_0 released 1 requests
Thanks for providing the logs. Can you share the docker run ...
command that you use for running the container? I wonder if adding --pid host
flag to the command helps.
Hello @krishung5 , I am very happy to your reply. This is my command to run the docker container, and the above results are got.
docker run --gpus all -itd --pid=host --net bridge --name triton-serve --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/model_repository:/models nvcr.io/nvidia/tritonserver:23.07-py3
In addition, I continue to test the pytorch model (fc_model_pt). The following is logs:
I0824 05:50:11.696933 5921 http_server.cc:3452] HTTP request: 2 /v2/models/fc_model_pt/versions/1/infer
I0824 05:50:11.697026 5921 infer_request.cc:751] [request id: <id_unknown>] prepared: [0x0x7f0850007570] request id: , model: fc_model_pt, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f0850002a48] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f0850002a48] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1
I0824 05:50:11.697385 5921 libtorch.cc:2666] model fc_model_pt, instance fc_model_pt_0, executing 1 requests
I0824 05:50:11.697411 5921 libtorch.cc:1224] TRITONBACKEND_ModelExecute: Running fc_model_pt_0 with 1 requests
I0824 05:50:11.697553 5921 pinned_memory_manager.cc:162] pinned memory allocation: size 16, addr 0x205000090
I0824 05:50:11.703377 5921 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,2,4]
I0824 05:50:11.703413 5921 http_server.cc:1103] HTTP: unable to provide 'output__0' in GPU, will use CPU
I0824 05:50:11.703433 5921 http_server.cc:1123] HTTP using buffer for: 'output__0', size: 32, addr: 0x7f080b976120
I0824 05:50:11.703449 5921 pinned_memory_manager.cc:162] pinned memory allocation: size 32, addr 0x2050000c0
I0824 05:50:11.703511 5921 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0824 05:50:11.703528 5921 http_server.cc:1103] HTTP: unable to provide 'output__1' in GPU, will use CPU
I0824 05:50:11.703551 5921 http_server.cc:1123] HTTP using buffer for: 'output__1', size: 64, addr: 0x7f080b978760
I0824 05:50:11.703569 5921 pinned_memory_manager.cc:162] pinned memory allocation: size 64, addr 0x2050000f0
I0824 05:50:11.704716 5921 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x2050000c0
I0824 05:50:11.704754 5921 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x2050000f0
I0824 05:50:11.704885 5921 http_server.cc:1197] HTTP release: size 32, addr 0x7f080b976120
I0824 05:50:11.704903 5921 http_server.cc:1197] HTTP release: size 64, addr 0x7f080b978760
I0824 05:50:11.704947 5921 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x205000090
I notice the following information. My model can not use GPU to inference, I don't know the reason appearing the this kind of phenomenon
- I0824 05:50:11.703413 5921 http_server.cc:1103] HTTP: unable to provide 'output__0' in GPU, will use CPU
- I0824 05:50:11.703528 5921 http_server.cc:1103] HTTP: unable to provide 'output__1' in GPU, will use CPU
Having the same issue, any solution?
Apologize for the delay reply. Regarding
I0824 05:50:11.703413 5921 http_server.cc:1103] HTTP: unable to provide 'output__0' in GPU, will use CPU I0824 05:50:11.703528 5921 http_server.cc:1103] HTTP: unable to provide 'output__1' in GPU, will use CPU
This is expected, as the final buffers will be allocated in CPU to be transported over the network.
I was wondering if the client and the server are run within the same container or separate ones? Besides, does the issue only occur when sending BLS requests to the PyTorch model, or does it also happen when sending requests directly to the PyTorch model?
I would also suggest trying with newer version of Triton since we include lots of bug fixes in a later version, especially we had some optimization for GPU tensor in Python backend. Could you try with Triton 24.03 if possible?