server
server copied to clipboard
`tritonclient[grpc]==2.24.0` Produces OOMs When Async gRPC Calls Are Performed
Description
It seems that tritonclient
is not de-allocating memory allocated when building the request to Triton Inference Server instance until the response is received. This causes the application to consume more and more memory with each concurrent call to Triton Server, yielding OOMs in production environments.
Triton Information What version of Triton are you using?
tritonclient[grpc]==2.24.0
Are you using the Triton container or did you build it yourself?
Triton 22.06 from NGC.
To Reproduce Example code:
input = grpcclient.InferInput(
name=TRITON_CRAFT_MODEL_INPUT_NAME,
shape=[1, input_image.shape[0], input_image.shape[1], input_image.shape[2]],
datatype="FP32",
)
output = grpcclient.InferRequestedOutput(TRITON_CRAFT_MODEL_OUTPUT_NAME)
input.set_data_from_numpy(np.array([input_image]))
response = await self._client.infer(
model_name=TRITON_CRAFT_MODEL_NAME,
model_version=TRITON_CRAFT_MODEL_VERSION,
inputs=[input],
outputs=[output],
client_timeout=TRITON_GRPC_TIMEOUT,
)
result = response.as_numpy(name=TRITON_CRAFT_MODEL_OUTPUT_NAME)
Expected behavior Memory should be de-allocated when the request is made to Triton Server.
Hi @narolski, could you provide the memory usage of your OOM use case? Besides, could you also provide more information about what kind of model you are running inference on?
@jbkyang-nvi Regarding the memory deallocation for the client, I assume it is expected not to deallocate the memory until the response is received?
Hi @krishung5 👋
I am running the CRAFT model (https://github.com/clovaai/CRAFT-pytorch) converted to ONNX, using the ONNX Backend. I am also running other, proprietary models using the same ONNX Backend. For any model, the OOM issue persists.
The requests to Triton Inference Server are made from the FastAPI-based backend. Since the single backend instance can handle more requests after the introduction of async gRPC calls, it stores more Triton Client requests in memory while waiting for the response. Thus, OOM occurs.
It seems to me that the obvious solution would be to de-allocate the input matrix that is sent to the Triton Inference Server when it is received by the server instance. For now, I have to limit the maximum concurrent requests per FastAPI app, which is not ideal.
Hi @narolski sorry for the late response. I think it is true Triton client does not deallocate memory while the request is not completed. If that is the problem, we can add it to our backlog of features for the client.
In the meantime, can you share an example of your client? Thanks
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue