server `tritonclient[grpc]==2.24.0` Produces OOMs When Async gRPC Calls Are Performed

Description It seems that tritonclient is not de-allocating memory allocated when building the request to Triton Inference Server instance until the response is received. This causes the application to consume more and more memory with each concurrent call to Triton Server, yielding OOMs in production environments.

Triton Information What version of Triton are you using?

tritonclient[grpc]==2.24.0

Are you using the Triton container or did you build it yourself?

Triton 22.06 from NGC.

To Reproduce Example code:

        input = grpcclient.InferInput(
            name=TRITON_CRAFT_MODEL_INPUT_NAME,
            shape=[1, input_image.shape[0], input_image.shape[1], input_image.shape[2]],
            datatype="FP32",
        )
        output = grpcclient.InferRequestedOutput(TRITON_CRAFT_MODEL_OUTPUT_NAME)

        input.set_data_from_numpy(np.array([input_image]))

        response = await self._client.infer(
                model_name=TRITON_CRAFT_MODEL_NAME,
                model_version=TRITON_CRAFT_MODEL_VERSION,
                inputs=[input],
                outputs=[output],
                client_timeout=TRITON_GRPC_TIMEOUT,
            )

        result = response.as_numpy(name=TRITON_CRAFT_MODEL_OUTPUT_NAME)

Expected behavior Memory should be de-allocated when the request is made to Triton Server.

Sep 06 '22 08:09 narolski

Hi @narolski, could you provide the memory usage of your OOM use case? Besides, could you also provide more information about what kind of model you are running inference on?

@jbkyang-nvi Regarding the memory deallocation for the client, I assume it is expected not to deallocate the memory until the response is received?

Sep 07 '22 22:09 krishung5

Hi @krishung5 👋

I am running the CRAFT model (https://github.com/clovaai/CRAFT-pytorch) converted to ONNX, using the ONNX Backend. I am also running other, proprietary models using the same ONNX Backend. For any model, the OOM issue persists.

The requests to Triton Inference Server are made from the FastAPI-based backend. Since the single backend instance can handle more requests after the introduction of async gRPC calls, it stores more Triton Client requests in memory while waiting for the response. Thus, OOM occurs.

It seems to me that the obvious solution would be to de-allocate the input matrix that is sent to the Triton Inference Server when it is received by the server instance. For now, I have to limit the maximum concurrent requests per FastAPI app, which is not ideal.

Sep 14 '22 10:09 narolski

Hi @narolski sorry for the late response. I think it is true Triton client does not deallocate memory while the request is not completed. If that is the problem, we can add it to our backlog of features for the client.

In the meantime, can you share an example of your client? Thanks

Sep 22 '22 19:09 jbkyang-nvi

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue

Nov 22 '22 03:11 jbkyang-nvi

server server copied to clipboard

`tritonclient[grpc]==2.24.0` Produces OOMs When Async gRPC Calls Are Performed

server
server copied to clipboard