server TritonModelException: Failed to open the cudaIpcHandle when I call the pytorch model upon wsl2 ubuntu 22.04

Hello everyone! I am calling inference requests on pytorch models in the Python backend file, appearing the bug that TritonModelException: Failed to open the cudaIpcHandle. I also change the version of ubuntu 20.04 LTS, the same bug is occurred. The enviroment

GPU: RTX GEFORCE 3060 (single one)
cuda:12.0
container: nvcr.io/nvidia/tritonserver:23.07-py3
Ubuntu 22.04 LTS / Ubuntu 20.04 LTS ( both have tried)

pytorch model config file (fc_model_pt)

name: "fc_model_pt" 
platform: "pytorch_libtorch" 
max_batch_size : 64
input [
  {
    name: "input__0" 
    data_type: TYPE_INT64 
    dims: [ -1 ]  # 
  }
]
output [
  {
    name: "output__0" # 命名规范同输入
    data_type: TYPE_FP32
    dims: [ -1, -1, 4 ]
  },
  {
    name: "output__1"
    data_type: TYPE_FP32
    dims: [ -1, -1, 8 ]
  }
]

python backend config file (model.py)



import json
import numpy as np
import triton_python_backend_utils as pb_utils



class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to intialize any state associated with this model.
        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """

        # You must parse model_config. JSON string is not parsed here
        self.model_config = model_config = json.loads(args['model_config'])

        # Get output__0 configuration
        output0_config = pb_utils.get_output_config_by_name(
            model_config, "output__0")

        # Get output__1 configuration
        output1_config = pb_utils.get_output_config_by_name(
            model_config, "output__1")

        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])
        self.output1_dtype = pb_utils.triton_string_to_numpy(output1_config['data_type'])

    def execute(self, requests):
        """
        requests : list
          A list of pb_utils.InferenceRequest
        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """

        output0_dtype = self.output0_dtype
        output1_dtype = self.output1_dtype

        responses = []

        # Every Python backend must iterate over everyone of the requests
        # and create a pb_utils.InferenceResponse for each of them.
        for request in requests:
           
            in_0 = pb_utils.get_input_tensor_by_name(request, "input__0")
            # fake data
            out_0 = np.array([1, 2, 3, 4, 5, 6, 7, 8])  # fixed for convenient
            out_tensor_0 = pb_utils.Tensor("output__0", out_0.astype(output0_dtype))
            # The second data is acquired by the fc_model_pt

            inference_request = pb_utils.InferenceRequest(
                model_name='fc_model_pt',
                requested_output_names=['output__0', 'output__1'],
                inputs=[in_0])

            inference_response = inference_request.exec()  # the bug is reported in this line

            out_tensor_1 = pb_utils.get_output_tensor_by_name(inference_response, 'output__1')
            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0, out_tensor_1], 
                                                            error=pb_utils.TritonError(' Inference Request occur error'))
            
            responses.append(inference_response)

        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

python backend config file

name: "custom_model"
backend: "python"
input [
  {
    name: "input__0"
    data_type: TYPE_INT64
    dims: [ -1, -1 ]
  }
]
output [
  {
    name: "output__0" 
    data_type: TYPE_FP32
    dims: [ -1, -1, 4 ]
  },
  {
    name: "output__1"
    data_type: TYPE_FP32
    dims: [ -1, -1, 8 ]
  }
]

The client file

import requests

if __name__ == "__main__":
    request_data = {
    "inputs": [{
        "name": "input__0",
        "shape": [1, 2],
        "datatype": "INT64",
        "data": [[1, 2]]
    }],
    "outputs": [{"name": "output__0"}, {"name": "output__1"}]
}
    res = requests.post(url="http://localhost:8000/v2/models/fc_model_pt/versions/1/infer",json=request_data)
    print(res.test)    # this line can get the result   <Response [200] >
    res = requests.post(url="http://localhost:8000/v2/models/custom_model/versions/1/infer",json=request_data)
    print(res)   # this line message <Response [400] >
    print(res.text)   # {"error":"Failed to process the request(s) for model instance 'custom_model', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error\n\nAt:\n  /models/custom_model/1/model.py(96): execute\n"}

another client file

import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import *

def main():
    verbose = False
    url = 'localhost:8000'
    request_count = 1000   #

    model_name = 'fc_model_pt'
    model_name1 = 'custom_model'

    try:
        # Need to specify large enough concurrency to issue all the
        # inference requests to the server in parallel.
        triton_client = httpclient.InferenceServerClient(
            url=url, verbose=verbose, )
        # triton_client1 = httpclient.InferenceServerClient(
        #     url=url, verbose=verbose, concurrency=request_count, max_greenlets=None)
    except Exception as e:
        print("context creation failed: " + str(e))
        sys.exit()

    input0_data = np.array([[1,2]]).astype(np.int64)
    inputs = [httpclient.InferInput(name='input__0',
                                  shape=[1,2],
                                  datatype="INT64")]
    inputs[0].set_data_from_numpy(input0_data)
    
    outputs =[]
    outputs.append( httpclient.InferRequestedOutput(name='output__0', 
                                              binary_data=True))
    
    outputs.append( httpclient.InferRequestedOutput(name='output__1', 
                                              binary_data=True))
    
    response = triton_client.infer(model_name=model_name1,
                                   inputs=inputs,
                                   outputs=outputs)
    
    # print(response.get_output('output__0'))
    
    output__0 = response.as_numpy('output__0')
    output__1 = response.as_numpy('output__1')

    print('output 0 :',output__0)
    print('output 1', output__1)

    triton_client.close()

    return

if __name__ == "__main__":
    main()

Bug

tritonclient.utils.InferenceServerException: [400] Failed to process the request(s) for model instance 'custom_model', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error

I speculate this bug is involved environment. but I don't know the concrete the reason. Hope to help, thanks for a lot.

Aug 22 '23 09:08 Grople

Hi @youwan114, can you share the output of nvidia-smi to check the CUDA driver version? I remember we have seen similar issue before, and upgrading the CUDA driver version helped.

Aug 22 '23 18:08 krishung5

Hi @youwan114, can you share the output of nvidia-smi to check the CUDA driver version? I remember we have seen similar issue before, and upgrading the CUDA driver version helped.

Hello @krishung5 Sure, I am very happy that you comment my issue. This is my output of nvidia-smi

github

I am grateful for your attention. Could you mind tell me how you solved this issue?

Aug 23 '23 02:08 Grople

This is my tritonserver log with command tritonserver --log-verbose=1 ......

I0823 03:10:20.056693 14719 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I0823 03:10:33.767570 14719 http_server.cc:3452] HTTP request: 2 /v2/models/custom_model/infer
I0823 03:10:33.767715 14719 infer_request.cc:751] [request id: 1] prepared: [0x0x7f5d2c002fb0] request id: 1, model: custom_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f5d2c002a78] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
override inputs:
inputs:
[0x0x7f5d2c002a78] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1

I0823 03:10:33.767812 14719 python_be.cc:1263] model custom_model, instance custom_model, executing 1 requests
I0823 03:10:33.788719 14719 infer_request.cc:751] [request id: 2] prepared: [0x0x7f5f180015e0] request id: 2, model: fc_model_pt, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f5f18001b18] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f5f18001b18] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1

I0823 03:10:33.788848 14719 libtorch.cc:2666] model fc_model_pt, instance fc_model_pt, executing 1 requests
I0823 03:10:33.788875 14719 libtorch.cc:1224] TRITONBACKEND_ModelExecute: Running fc_model_pt with 1 requests
I0823 03:10:33.788986 14719 pinned_memory_manager.cc:162] pinned memory allocation: size 16, addr 0x205000090
I0823 03:10:35.424447 14719 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,2,4]
I0823 03:10:35.424560 14719 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0823 03:10:35.425885 14719 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x205000090
I0823 03:10:35.635865 14719 python_be.cc:2230] TRITONBACKEND_ModelInstanceExecute: model instance name custom_model released 1 requests

I also changed the better equipment, and the same result is obtained as above. In my opinion, the server side calls the pytorch model (fc_model_pt) successfully, but can not return the results, I am very look forward to any reply!

Aug 23 '23 03:08 Grople

Continue to test, I have tried to use cpu version of pytorch model (fc_model_pt), the results have achieved, so I do think the problem is locate the CUDA.

Aug 23 '23 06:08 Grople

For furthermore information:

CPU info:

I0823 09:45:06.263810 72295 http_server.cc:3372] HTTP request: 2 /v2/models/custom_model/infer
I0823 09:45:06.263942 72295 infer_request.cc:729] [request id: 1] prepared: [0x0x7f211c009050] request id: 1, model: custom_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f211c008f78] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
override inputs:
inputs:
[0x0x7f211c008f78] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1

I0823 09:45:06.264078 72295 python_be.cc:1094] model custom_model, instance custom_model_0, executing 1 requests
hello world
input data  <c_python_backend_utils.Tensor object at 0x7f107ea66ef0> <class 'c_python_backend_utils.Tensor'>
 there is data
I0823 09:45:06.264829 72295 infer_request.cc:729] [request id: 2] prepared: [0x0x7f24300013f0] request id: 2, model: fc_model_pt, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f24300016b8] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f24300016b8] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1

I0823 09:45:06.264912 72295 libtorch.cc:2129] model fc_model_pt, instance fc_model_pt_0, executing 1 requests
I0823 09:45:06.264932 72295 libtorch.cc:988] TRITONBACKEND_ModelExecute: Running fc_model_pt_0 with 1 requests
I0823 09:45:06.265569 72295 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,2,4]
I0823 09:45:06.265598 72295 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0823 09:45:06.265996 72295 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [8]
I0823 09:45:06.266024 72295 http_server.cc:1118] HTTP using buffer for: 'output__0', size: 32, addr: 0x7f231c005450
I0823 09:45:06.266040 72295 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0823 09:45:06.266054 72295 http_server.cc:1118] HTTP using buffer for: 'output__1', size: 64, addr: 0x7f231c005e50
I0823 09:45:06.266089 72295 http_server.cc:1192] HTTP release: size 32, addr 0x7f231c005450
I0823 09:45:06.266103 72295 http_server.cc:1192] HTTP release: size 64, addr 0x7f231c005e50
I0823 09:45:06.266132 72295 python_be.cc:1980] TRITONBACKEND_ModelInstanceExecute: model instance name custom_model_0 released 1 requests

GPU info:

I0823 09:48:50.309827 76851 http_server.cc:3372] HTTP request: 2 /v2/models/custom_model/infer
I0823 09:48:50.309920 76851 infer_request.cc:729] [request id: 1] prepared: [0x0x7f835c0047d0] request id: 1, model: custom_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7f835c004298] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
override inputs:
inputs:
[0x0x7f835c004298] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [1,2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1

I0823 09:48:50.310087 76851 python_be.cc:1094] model custom_model, instance custom_model_0, executing 1 requests
hello world
input data  <c_python_backend_utils.Tensor object at 0x7fdd74a8f830> <class 'c_python_backend_utils.Tensor'>
 there is data
I0823 09:48:50.356416 76851 infer_request.cc:729] [request id: 2] prepared: [0x0x7f88180013f0] request id: 2, model: fc_model_pt, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f88180016b8] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f88180016b8] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1

I0823 09:48:50.356578 76851 libtorch.cc:2129] model fc_model_pt, instance fc_model_pt_0, executing 1 requests
I0823 09:48:50.356608 76851 libtorch.cc:988] TRITONBACKEND_ModelExecute: Running fc_model_pt_0 with 1 requests
I0823 09:48:50.356763 76851 pinned_memory_manager.cc:161] pinned memory allocation: size 16, addr 0x204e00090
I0823 09:48:53.166461 76851 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,2,4]
I0823 09:48:53.166598 76851 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0823 09:48:53.166788 76851 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x204e00090
I0823 09:48:53.333064 76851 python_be.cc:1980] TRITONBACKEND_ModelInstanceExecute: model instance name custom_model_0 released 1 requests

Aug 23 '23 09:08 Grople

Thanks for providing the logs. Can you share the docker run ... command that you use for running the container? I wonder if adding --pid host flag to the command helps.

Aug 23 '23 22:08 krishung5

Hello @krishung5 , I am very happy to your reply. This is my command to run the docker container, and the above results are got. docker run --gpus all -itd --pid=host --net bridge --name triton-serve --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/model_repository:/models nvcr.io/nvidia/tritonserver:23.07-py3

In addition, I continue to test the pytorch model (fc_model_pt). The following is logs:

I0824 05:50:11.696933 5921 http_server.cc:3452] HTTP request: 2 /v2/models/fc_model_pt/versions/1/infer
I0824 05:50:11.697026 5921 infer_request.cc:751] [request id: <id_unknown>] prepared: [0x0x7f0850007570] request id: , model: fc_model_pt, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f0850002a48] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
override inputs:
inputs:
[0x0x7f0850002a48] input: input__0, type: INT64, original shape: [1,2], batch + shape: [1,2], shape: [2]
original requested outputs:
output__0
output__1
requested outputs:
output__0
output__1

I0824 05:50:11.697385 5921 libtorch.cc:2666] model fc_model_pt, instance fc_model_pt_0, executing 1 requests
I0824 05:50:11.697411 5921 libtorch.cc:1224] TRITONBACKEND_ModelExecute: Running fc_model_pt_0 with 1 requests
I0824 05:50:11.697553 5921 pinned_memory_manager.cc:162] pinned memory allocation: size 16, addr 0x205000090
I0824 05:50:11.703377 5921 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,2,4]
I0824 05:50:11.703413 5921 http_server.cc:1103] HTTP: unable to provide 'output__0' in GPU, will use CPU
I0824 05:50:11.703433 5921 http_server.cc:1123] HTTP using buffer for: 'output__0', size: 32, addr: 0x7f080b976120
I0824 05:50:11.703449 5921 pinned_memory_manager.cc:162] pinned memory allocation: size 32, addr 0x2050000c0
I0824 05:50:11.703511 5921 infer_response.cc:167] add response output: output: output__1, type: FP32, shape: [1,2,8]
I0824 05:50:11.703528 5921 http_server.cc:1103] HTTP: unable to provide 'output__1' in GPU, will use CPU
I0824 05:50:11.703551 5921 http_server.cc:1123] HTTP using buffer for: 'output__1', size: 64, addr: 0x7f080b978760
I0824 05:50:11.703569 5921 pinned_memory_manager.cc:162] pinned memory allocation: size 64, addr 0x2050000f0
I0824 05:50:11.704716 5921 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x2050000c0
I0824 05:50:11.704754 5921 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x2050000f0
I0824 05:50:11.704885 5921 http_server.cc:1197] HTTP release: size 32, addr 0x7f080b976120
I0824 05:50:11.704903 5921 http_server.cc:1197] HTTP release: size 64, addr 0x7f080b978760
I0824 05:50:11.704947 5921 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x205000090

I notice the following information. My model can not use GPU to inference, I don't know the reason appearing the this kind of phenomenon

I0824 05:50:11.703413 5921 http_server.cc:1103] HTTP: unable to provide 'output__0' in GPU, will use CPU
I0824 05:50:11.703528 5921 http_server.cc:1103] HTTP: unable to provide 'output__1' in GPU, will use CPU

Aug 24 '23 05:08 Grople

Having the same issue, any solution?

Apr 21 '24 07:04 asafberreby

Apologize for the delay reply. Regarding

I0824 05:50:11.703413 5921 http_server.cc:1103] HTTP: unable to provide 'output__0' in GPU, will use CPU I0824 05:50:11.703528 5921 http_server.cc:1103] HTTP: unable to provide 'output__1' in GPU, will use CPU

This is expected, as the final buffers will be allocated in CPU to be transported over the network.

I was wondering if the client and the server are run within the same container or separate ones? Besides, does the issue only occur when sending BLS requests to the PyTorch model, or does it also happen when sending requests directly to the PyTorch model?

I would also suggest trying with newer version of Triton since we include lots of bug fixes in a later version, especially we had some optimization for GPU tensor in Python backend. Could you try with Triton 24.03 if possible?

Apr 24 '24 23:04 krishung5

server server copied to clipboard

TritonModelException: Failed to open the cudaIpcHandle when I call the pytorch model upon wsl2 ubuntu 22.04

Bug

server
server copied to clipboard