server
server copied to clipboard
health check should not say it's ready when cuda device-side assertion error is triggered
Is your feature request related to a problem? Please describe. If one of the model triggers a cuda device-side assertion error, all the model instances from that trtis process is blocked from calling any further cuda api calls. In this case, the trtis health check api should report 'not ready', but right now it reports that the server and the model are ready.
# example response reporting a cuda assertion error
{"error":"pinned input buffer H2D: failed to perform CUDA copy: device-side assert triggered"}'
$ curl -v localhost:8000/v2/health/ready
* Trying 127.0.0.1:8000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
<
* Connection #0 to host localhost left intact
$ curl -v localhost:8000/v2/health/live
* Trying 127.0.0.1:8000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/live HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
<
* Connection #0 to host localhost left intact
$ curl -v localhost:8000/v2/models/cuda-error/ready
* Trying 127.0.0.1:8000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/models/cuda-error/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
<
* Connection #0 to host localhost left intact
Describe the solution you'd like If there are gpu models, trtis health check api should check if cuda device side assertion error is triggered and report 'not ready' if it is so.
Additional context This is a simple model to trigger a cuda device side assertion error
import torch
import torch.nn as nn
class CudaError(nn.Module):
def __init__(self,):
super().__init__()
def forward(self,x):
return torch.gather(torch.tensor([0]).cuda(),dim=0,index=torch.tensor([1]).cuda()).type(torch.int32)
torch.jit.save(torch.jit.script(CudaError()),'model.pt')
As a work around, you can unload the model, and use the model's health check path (which would return unhealthy after the model is unloaded.
Assuming the model is called mymodel
, this is how you can unload it:
from urllib import request as urllib_request
try:
// inference logic ...
except RuntimeError as e:
if 'CUDA error: device-side assert triggered' in str(e):
unload_request = urllib_request.Request('http://localhost:5000/v2/repository/models/mymodel/unload', method='POST')
with urllib_request.urlopen(unload_request) as resp:
pass
raise
The model's health check path is at /v2/models/mymodel/ready
. Calls to the model's health check will fail after the model is unloaded.
so we still needs this workaround? I found when signal 11 received, the health check is still ok😂
For example: python backend process hangs with this signal 11 error, but triton readiness is still ok, so request are incoming
DownCropResizer is doing nothing!
{"pod_name": "sd15-triton-5cc495c8cc-zjhmx", "namespace": "production", "log_type": "access_log", "request_id": "3be8a710-3428-11ee-ba3c-00163e253f9a", "encoder_msg": "jasee", "event": "kestrel encode", "logger": "triton_logger", "level": "info", "timestamp": "2023-08-06T07:09:56.292837Z"}
Signal (11) received.
Do we have model readiness probes?
I also observed this issue. I can reproduce it by overloading triton with more images/sec than it can handle. It will provoke this segfault / "Signal (11) received." message but still report it is ready. In my opinion this is a bug in triton, especially if strict readiness check is enabled!