clearml-serving icon indicating copy to clipboard operation
clearml-serving copied to clipboard

triton model breaks serving instance

Open stephanbertl opened this issue 1 year ago • 4 comments

We have setup clearml serving on Kubernetes including triton support. Our triton instance has no GPU, so deploying a model leads to the following error in the triton instance:

E0718 07:41:21.083440 30 model_lifecycle.cc:596] failed to load 'distilbert-test2' version 1: Invalid argument: unable to load model 'distilbert-test2', TensorRT backend supports only GPU device

Trying to remove the model again is not possible: clearml-serving --id 5097f44fe9cb45f7be2a917c6fe8cad9 model remove --endpoint distilbert-test2

yields the following:

`clearml-serving - CLI for launching ClearML serving engine 2023-07-18 09:47:59,260 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9 2023-07-18 09:47:59,290 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9

Error: Task ID "5097f44fe9cb45f7be2a917c6fe8cad9" could not be found `

In general, our observation is that the serving is not resilient against these kind of problems. A broken model should not break the instance.

stephanbertl avatar Jul 18 '23 07:07 stephanbertl

Hi @stephanbertl, thanks for this report. We will look into it 🙂

jkhenning avatar Jul 18 '23 20:07 jkhenning

any update? The serving module seems totally unstable, a model that is not working breaks the whole serving server. How is that supposed to work in prod?

stephanbertl avatar Nov 20 '23 15:11 stephanbertl

Hi @stephanbertl, I have not managed to reproduce this, can you perhaps provide some more information? Specifically, I assume you're using the serving helm chart, is that correct? Can you share how you configured it?

jkhenning avatar Nov 21 '23 07:11 jkhenning

@jkhenning sorry for not coming back earlier to you.

I would say the culprit is the tritonserver default value of --exit-on-error=true.

I quickly checked the code and I could not found a way to set this in clearm-serving.

stephanbertl avatar May 03 '24 08:05 stephanbertl