server icon indicating copy to clipboard operation
server copied to clipboard

Signal 6 or Signal 11 from python backend.

Open kbegiedza opened this issue 1 year ago • 2 comments

Description

In k8s cluster I have with multiple GPUs and a single Triton server's pod with multiple models including BLS based models.

Sometimes under heavy pressure triton restarts with Signal 6 or Signal 11 error (trace logs below)

I can observe that right before crash server allocates 2x RAM:

image

Triton Information nvcr.io/nvidia/tritonserver:23.02-py3

To Reproduce Unknown ?

Full log below: triton-server.log

Expected behavior Stable execution.

kbegiedza avatar Jan 17 '24 20:01 kbegiedza

Hi @kbegiedza, as a preliminary check, can you see if you can replicate this behavior on our latest container nvcr.io/nvidia/tritonserver:23.12-py3? Thanks.

nv-kmcgill53 avatar Jan 18 '24 20:01 nv-kmcgill53

We have the same problem with every 23.XX version of Triton Server. Interesting fact is that we have signal 11 error with one GPU, and sometimes signal 6 with 2 GPU's

Fleyderer avatar Jan 29 '24 08:01 Fleyderer