server Segmentation fault (core dumped)

Description Currently running triton on k8s and starting Triton server version 2.46.0, we are seeing segmentation faults which causes the server to restart. It does seem to happen rather very infrequently, maybe once every 1-2 days on some subset of our pods running triton. Our prior version was 2.44.0 and we never saw this issue before. We also started using redis caching for ensembles in this version.

stacktrace

ERROR 2024-06-06T11:17:51.579703381Z Segmentation fault (core dumped) tritonserver --model-repository= --model-control-mode= --cache-config=redis,host= --cache-config=redis,port=
ERROR 2024-06-06T11:17:44.820090811Z {}
ERROR 2024-06-06T11:17:44.820086456Z 6# 0x00007FC829FFC850 in /lib/x86_64-linux-gnu/libc.so.6
ERROR 2024-06-06T11:17:44.820081702Z 5# 0x00007FC829F6AAC3 in /lib/x86_64-linux-gnu/libc.so.6
ERROR 2024-06-06T11:17:44.820076868Z 4# 0x00007FC82A1DB253 in /lib/x86_64-linux-gnu/libstdc++.so.6
ERROR 2024-06-06T11:17:44.820072150Z 3# 0x00005AF0728F4359 in tritonserver
ERROR 2024-06-06T11:17:44.820067363Z 2# 0x00005AF0728FE508 in tritonserver
ERROR 2024-06-06T11:17:44.820061732Z 1# 0x00007FC829F18520 in /lib/x86_64-linux-gnu/libc.so.6
ERROR 2024-06-06T11:17:44.820035490Z 0# 0x00005AF07289C04D in tritonserver

Not sure if related, but one of the containers had corrupted double-linked list (not small) in the logs.

Triton Information

cache_enabled | 1 |
exit_timeout | 30 |
strict_readiness | 1 |
min_supported_compute_capability | 6.0 |
cuda_memory_pool_byte_size{0} | 67108864 |
pinned_memory_pool_byte_size | 268435456 |
rate_limit | OFF |
model_config_name | |
strict_model_config | 0 |
model_control_mode | MODE_POLL |
server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
server_version | 2.46.0 |
server_id | triton |

Are you using the Triton container or did you build it yourself? Triton container nvcr.io/nvidia/tritonserver:24.05-py3

To Reproduce hard to reproduce since it is very infrequent and possibly random

ensemble model consisting of pytorch backend and python backend

Expected behavior No core dump

Jun 06 '24 16:06 rahchuenmonroe

Another interesting thing we just noticed is that, when our redis instance went down for a few minutes and when the pods lost connection to it, all of our triton servers had this same segmentation fault at the same time and all restarted. Not sure if that is this pattern is the same reason we are seeing those core dumps intermittently, but either way, those segmentation faults seem like unexpected behavior.

Jun 07 '24 17:06 rahchuenmonroe

@rahchuenmonroe we would appreciate if you could experiment and give us more concrete steps to reproduce the issue.

Jun 11 '24 17:06 statiraju

[6861] created for tracking

Jun 11 '24 17:06 statiraju

Our prior version was 2.44.0 and we never saw this issue before. We also started using redis caching for ensembles in this version.

In addition to @statiraju comment about providing reproduction steps, it would be great if you could also try simplifying the set of things that changed to help isolate the issue. For example, two experiments:

24.04 vs 24.05 both with no redis caching, see if 24.05 introduces failure
24.05 no redis caching vs. 24.05 with redis caching, see if redis caching introduces failure

Jun 11 '24 18:06 rmccorm4

To clarify, we were using 24.03 prior and now are using 24.05 because of ensemble caching.

So far: 24.05, caching enabled: pods restart after core dumps. 24.05, caching disabled: no restarts. 24.03, caching disabled: no restarts. Can't test for 24.03 with caching enabled since ensemble caching is only in latest version.

The only way I have been able to somewhat accidentally reproduce this was when the redis instance went down for a few minutes during an update and then all pods core dumped. I'm not sure if that's something you folks can somewhat reproduce? Although it's hard to say that is what is happening intermittently also, it does seem a bit suspicious. And from my understanding, if the redis caching request fails, the request should be processed normally instead, so losing connection to redis should not restart the server.

Let me know if I can provide any more information on this, thanks for looking into this!

Jun 17 '24 20:06 rahchuenmonroe

FYI, we were using 24.07 with no caching, still got this error couple of times although it was rare

E0420 21:02:21.661436313  345 sync.cc:1005]     ASSERTION FAILED: prior > 0
Signal (6) received.
malloc(): unsorted double linked list corrupted
Signal (6) received.

@rmccorm4 @statiraju did we find any resolution for this ? do we need to upgrade to latest triton version ?

Apr 21 '25 05:04 jayakommuru

Segmentation fault (core dumped) - Server version 2.46.0