Segmentation fault (core dumped) - Server version 2.46.0
Description Currently running triton on k8s and starting Triton server version 2.46.0, we are seeing segmentation faults which causes the server to restart. It does seem to happen rather very infrequently, maybe once every 1-2 days on some subset of our pods running triton. Our prior version was 2.44.0 and we never saw this issue before. We also started using redis caching for ensembles in this version.
stacktrace
ERROR 2024-06-06T11:17:51.579703381Z Segmentation fault (core dumped) tritonserver --model-repository= --model-control-mode= --cache-config=redis,host= --cache-config=redis,port=
ERROR 2024-06-06T11:17:44.820090811Z {}
ERROR 2024-06-06T11:17:44.820086456Z 6# 0x00007FC829FFC850 in /lib/x86_64-linux-gnu/libc.so.6
ERROR 2024-06-06T11:17:44.820081702Z 5# 0x00007FC829F6AAC3 in /lib/x86_64-linux-gnu/libc.so.6
ERROR 2024-06-06T11:17:44.820076868Z 4# 0x00007FC82A1DB253 in /lib/x86_64-linux-gnu/libstdc++.so.6
ERROR 2024-06-06T11:17:44.820072150Z 3# 0x00005AF0728F4359 in tritonserver
ERROR 2024-06-06T11:17:44.820067363Z 2# 0x00005AF0728FE508 in tritonserver
ERROR 2024-06-06T11:17:44.820061732Z 1# 0x00007FC829F18520 in /lib/x86_64-linux-gnu/libc.so.6
ERROR 2024-06-06T11:17:44.820035490Z 0# 0x00005AF07289C04D in tritonserver
Not sure if related, but one of the containers had
corrupted double-linked list (not small) in the logs.
Triton Information
cache_enabled | 1 |
exit_timeout | 30 |
strict_readiness | 1 |
min_supported_compute_capability | 6.0 |
cuda_memory_pool_byte_size{0} | 67108864 |
pinned_memory_pool_byte_size | 268435456 |
rate_limit | OFF |
model_config_name | |
strict_model_config | 0 |
model_control_mode | MODE_POLL |
server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
server_version | 2.46.0 |
server_id | triton |
Are you using the Triton container or did you build it yourself? Triton container nvcr.io/nvidia/tritonserver:24.05-py3
To Reproduce hard to reproduce since it is very infrequent and possibly random
ensemble model consisting of pytorch backend and python backend
Expected behavior No core dump
Another interesting thing we just noticed is that, when our redis instance went down for a few minutes and when the pods lost connection to it, all of our triton servers had this same segmentation fault at the same time and all restarted. Not sure if that is this pattern is the same reason we are seeing those core dumps intermittently, but either way, those segmentation faults seem like unexpected behavior.
@rahchuenmonroe we would appreciate if you could experiment and give us more concrete steps to reproduce the issue.
[6861] created for tracking
Our prior version was 2.44.0 and we never saw this issue before. We also started using redis caching for ensembles in this version.
In addition to @statiraju comment about providing reproduction steps, it would be great if you could also try simplifying the set of things that changed to help isolate the issue. For example, two experiments:
- 24.04 vs 24.05 both with no redis caching, see if 24.05 introduces failure
- 24.05 no redis caching vs. 24.05 with redis caching, see if redis caching introduces failure
To clarify, we were using 24.03 prior and now are using 24.05 because of ensemble caching.
So far: 24.05, caching enabled: pods restart after core dumps. 24.05, caching disabled: no restarts. 24.03, caching disabled: no restarts. Can't test for 24.03 with caching enabled since ensemble caching is only in latest version.
The only way I have been able to somewhat accidentally reproduce this was when the redis instance went down for a few minutes during an update and then all pods core dumped. I'm not sure if that's something you folks can somewhat reproduce? Although it's hard to say that is what is happening intermittently also, it does seem a bit suspicious. And from my understanding, if the redis caching request fails, the request should be processed normally instead, so losing connection to redis should not restart the server.
Let me know if I can provide any more information on this, thanks for looking into this!
FYI, we were using 24.07 with no caching, still got this error couple of times although it was rare
E0420 21:02:21.661436313 345 sync.cc:1005] ASSERTION FAILED: prior > 0
Signal (6) received.
malloc(): unsorted double linked list corrupted
Signal (6) received.
@rmccorm4 @statiraju did we find any resolution for this ? do we need to upgrade to latest triton version ?