server
server copied to clipboard
Memory leak with multiple GPU and BLS.
Description
I have multiple GPUs and a single Triton server's pod running inside Kubernetes cluster with multiple models including BLS and TensorRT's engine models.
When my models are running on node with single GPU there is no issue at all, but adding additional GPU results in slowly increasing memory.
I also observed rapidly increasing memory while using two GPUs, but only on first one (take a look at chart below).
Triton Information
Tested with official images:
nvcr.io/nvidia/tritonserver:23.12-py3nvcr.io/nvidia/tritonserver:24.04-py3
To Reproduce
My sample bls look like one below:
class TritonPythonModel:
def __init__(self) -> None:
self.__bbox_service = BBoxService()
self.__image_service = ImageService()
async def execute(self, requests):
responses = []
for request in requests:
image_triton = pb_utils.get_input_tensor_by_name(request, "IMAGES")
image_tensor = from_dlpack(image_triton.to_dlpack()).cuda().to(torch.float32) # type: ignore
image_tensor = self.__image_service.reverse_last_channel(image_tensor)
preprocessed_image_tensor = self.__preprocess(image_tensor, _PREPROCESS_LETTERBOX_SIZE).to(torch.float16)
inference_request_input = pb_utils.Tensor.from_dlpack("images", to_dlpack(preprocessed_image_tensor)) # type: ignore
inference_request = pb_utils.InferenceRequest( # type: ignore
model_name="__model:0",
requested_output_names=["output0"],
inputs=[inference_request_input],
)
inference_response = await inference_request.async_exec()
prediction_triton = pb_utils.get_output_tensor_by_name(inference_response, name="output0")
prediction_tensor = from_dlpack(prediction_triton.to_dlpack()) # type: ignore
bboxes_tensor = self.__postprocess(prediction_tensor, image_tensor.shape, _PREPROCESS_LETTERBOX_SIZE)
bboxes_tensor = bboxes_tensor.contiguous()
bboxes_triton = pb_utils.Tensor.from_dlpack("BBOXES", to_dlpack(bboxes_tensor.to(torch.float16))) # type: ignore
inference_response = pb_utils.InferenceResponse(output_tensors=[bboxes_triton]) # type: ignore
responses.append(inference_response)
return responses
Wondering if those cuda() or device="cuda" used inside my preprocess / image service can raise issues while running on multiple GPUs.
Expected behavior No memory leak and proper requests load balance.
@kbegiedza Thanks for reporting this issue. Can you share the code for BBoxService and ImageService as well so that we can repro this issue?
Closing due to in-activity.