server gRPC Segfaults in Triton 24.06 due to Low Request Cancellation Timeout

Description We use gRPC to query Triton for Model Ready, Model Metadata and Model Inference Requests. When running the Triton server for a sustained period of time, we get Segfaults unexpectedly [Signal 11 received]. The trace of the segfault is attached with this issue but the time when it occurs cannot be predicted and happens across our servers at irregular intervals.

Triton Information What version of Triton are you using? Version 24.05 I also built my CPU only version with Debug symbols and reproduced the same issue as well

Are you using the Triton container or did you build it yourself? I can reproduce the issue in Triton Container from NGC and in the custom build that I've done as well

To Reproduce Steps to reproduce the behavior.

Use xDS with gRPC client side load balancing for routing requests from client to multiple Triton servers
Use basic Golang client to query model in Triton (like here). [Uses gRPC v1.63.2]
Use a Tensorflow DNN model with only numeric or string features
Setup a Model Repo Agent as a side car within the same pod as Triton to copy models from S3 to pod and then trigger Triton with a default config.pbtxt as a HTTP payload as described here.
Setup Triton 24.05 as a container within the same pod and use startup command:

   tritonserver --model-store=/model_repo \
                   --model-control-mode=explicit \
                   --exit-on-error=true \
                   --strict-readiness=true \
                   --allow-cpu-metrics=true

Add Tensorflow TFDF to Triton to support GBT models:

RUN wget https://files.pythonhosted.org/packages/b8/1a/f1a21d24357b9f760e791c7b54804535421de1f1ee08a456d3a7f7ec7bbb/tensorflow_decision_forests-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl && \
    unzip ./tensorflow_decision_forests-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -d ./tensorflow_decision_forests && \
    cp ./tensorflow_decision_forests/tensorflow_decision_forests/tensorflow/ops/inference/inference.so /home/inference.so

Use the following environment variable:

    - name: TF_ENABLE_ONEDNN_OPTS
      value: "0"
    - name: LD_PRELOAD
      value: /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
    - name: GRPC_VERBOSITY
      value: INFO

Description of node c6i.16xlarge instance from AWS

Error occurred

2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 8# 0x0000782BD9213850 in /lib/x86_64-linux-gnu/libc.so.6"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 7# 0x0000782BD9181AC3 in /lib/x86_64-linux-gnu/libc.so.6"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 6# 0x0000782BD93F2253 in /lib/x86_64-linux-gnu/libstdc++.so.6"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 5# 0x0000564E2B3A40B9 in tritonserver"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 4# 0x0000564E2B3AE85B in tritonserver"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 3# 0x0000564E2B3AD954 in tritonserver"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 2# 0x0000564E2B3B3706 in tritonserver"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 1# 0x0000782BD912F520 in /lib/x86_64-linux-gnu/libc.so.6"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 0# 0x0000564E2B34EA9D in tritonserver"} |  

2024-06-24 14:36:06.709{"stream":"stderr","logtag":"F","log":"Signal (11) received."} |   |   | 2024-06-24 14:36:06.709 | {"stream":"stderr","logtag":"F","log":"Signal (11) received."}
  |   | 2024-06-24 14:36:06.709 | {"stream":"stderr","logtag":"F","log":"Signal (11) received."}

GDB Trace by building Debug Container as described here

Thread 11 "tritonserver" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1e65000 (LWP 155)]
0x0000555555719724 in triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::Context::IsCancelled (this=0x0) at /workspace/src/grpc/infer_handler.h:669
669     /workspace/src/grpc/infer_handler.h: No such file or directory.
(gdb) bt
#0  0x0000555555719724 in triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::Context::IsCancelled (this=0x0) at /workspace/src/grpc/infer_handler.h:669
#1  0x0000555555715e48 in triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::IsGrpcContextCancelled (this=0x555564d52600) at /workspace/src/grpc/infer_handler.h:1034
#2  0x00005555557100c1 in triton::server::grpc::ModelInferHandler::Process (this=0x5555570c2800, state=0x555564d52600, rpc_ok=true) at /workspace/src/grpc/infer_handler.cc:696
#3  0x00005555556f638d in _ZZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS4_27WithAsyncMethod_ServerReadyINS4_26WithAsyncMethod_ModelReadyINS4_30WithAsyncMethod_ServerMetadataINS4_29WithAsyncMethod_ModelMetadataINS4_26WithAsyncMethod_ModelInferINS4_32WithAsyncMethod_ModelStreamInferINS4_27WithAsyncMethod_ModelConfigINS4_31WithAsyncMethod_ModelStatisticsINS4_31WithAsyncMethod_RepositoryIndexINS4_35WithAsyncMethod_RepositoryModelLoadINS4_37WithAsyncMethod_RepositoryModelUnloadINS4_40WithAsyncMethod_SystemSharedMemoryStatusINS4_42WithAsyncMethod_SystemSharedMemoryRegisterINS4_44WithAsyncMethod_SystemSharedMemoryUnregisterINS4_38WithAsyncMethod_CudaSharedMemoryStatusINS4_40WithAsyncMethod_CudaSharedMemoryRegisterINS4_42WithAsyncMethod_CudaSharedMemoryUnregisterINS4_28WithAsyncMethod_TraceSettingINS4_27WithAsyncMethod_LogSettingsINS4_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS3_18ModelInferResponseEEENS3_17ModelInferRequestES1C_E5StartEvENKUlvE_clEv (__closure=0x555557081168) at /workspace/src/grpc/infer_handler.h:1310
#4  0x000055555570b2a3 in _ZSt13__invoke_implIvZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS5_27WithAsyncMethod_ServerReadyINS5_26WithAsyncMethod_ModelReadyINS5_30WithAsyncMethod_ServerMetadataINS5_29WithAsyncMethod_ModelMetadataINS5_26WithAsyncMethod_ModelInferINS5_32WithAsyncMethod_ModelStreamInferINS5_27WithAsyncMethod_ModelConfigINS5_31WithAsyncMethod_ModelStatisticsINS5_31WithAsyncMethod_RepositoryIndexINS5_35WithAsyncMethod_RepositoryModelLoadINS5_37WithAsyncMethod_RepositoryModelUnloadINS5_40WithAsyncMethod_SystemSharedMemoryStatusINS5_42WithAsyncMethod_SystemSharedMemoryRegisterINS5_44WithAsyncMethod_SystemSharedMemoryUnregisterINS5_38WithAsyncMethod_CudaSharedMemoryStatusINS5_40WithAsyncMethod_CudaSharedMemoryRegisterINS5_42WithAsyncMethod_CudaSharedMemoryUnregisterINS5_28WithAsyncMethod_TraceSettingINS5_27WithAsyncMethod_LogSettingsINS5_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS4_18ModelInferResponseEEENS4_17ModelInferRequestES1D_E5StartEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_ (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#5  0x000055555570b1ff in _ZSt8__invokeIZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS5_27WithAsyncMethod_ServerReadyINS5_26WithAsyncMethod_ModelReadyINS5_30WithAsyncMethod_ServerMetadataINS5_29WithAsyncMethod_ModelMetadataINS5_26WithAsyncMethod_ModelInferINS5_32WithAsyncMethod_ModelStreamInferINS5_27WithAsyncMethod_ModelConfigINS5_31WithAsyncMethod_ModelStatisticsINS5_31WithAsyncMethod_RepositoryIndexINS5_35WithAsyncMethod_RepositoryModelLoadINS5_37WithAsyncMethod_RepositoryModelUnloadINS5_40WithAsyncMethod_SystemSharedMemoryStatusINS5_42WithAsyncMethod_SystemSharedMemoryRegisterINS5_44WithAsyncMethod_SystemSharedMemoryUnregisterINS5_38WithAsyncMethod_CudaSharedMemoryStatusINS5_40WithAsyncMethod_CudaSharedMemoryRegisterINS5_42WithAsyncMethod_CudaSharedMemoryUnregisterINS5_28WithAsyncMethod_TraceSettingINS5_27WithAsyncMethod_LogSettingsINS5_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS4_18ModelInferResponseEEENS4_17ModelInferRequestES1D_E5StartEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS1J_DpOS1K_ (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#6  0x000055555570b170 in _ZNSt6thread8_InvokerISt5tupleIJZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS7_27WithAsyncMethod_ServerReadyINS7_26WithAsyncMethod_ModelReadyINS7_30WithAsyncMethod_ServerMetadataINS7_29WithAsyncMethod_ModelMetadataINS7_26WithAsyncMethod_ModelInferINS7_32WithAsyncMethod_ModelStreamInferINS7_27WithAsyncMethod_ModelConfigINS7_31WithAsyncMethod_ModelStatisticsINS7_31WithAsyncMethod_RepositoryIndexINS7_35WithAsyncMethod_RepositoryModelLoadINS7_37WithAsyncMethod_RepositoryModelUnloadINS7_40WithAsyncMethod_SystemSharedMemoryStatusINS7_42WithAsyncMethod_SystemSharedMemoryRegisterINS7_44WithAsyncMethod_SystemSharedMemoryUnregisterINS7_38WithAsyncMethod_CudaSharedMemoryStatusINS7_40WithAsyncMethod_CudaSharedMemoryRegisterINS7_42WithAsyncMethod_CudaSharedMemoryUnregisterINS7_28WithAsyncMethod_TraceSettingINS7_27WithAsyncMethod_LogSettingsINS7_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS6_18ModelInferResponseEEENS6_17ModelInferRequestES1F_E5StartEvEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE (this=0x555557081168) at /usr/include/c++/11/bits/std_thread.h:259
#7  0x000055555570b120 in _ZNSt6thread8_InvokerISt5tupleIJZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS7_27WithAsyncMethod_ServerReadyINS7_26WithAsyncMethod_ModelReadyINS7_30WithAsyncMethod_ServerMetadataINS7_29WithAsyncMethod_ModelMetadataINS7_26WithAsyncMethod_ModelInferINS7_32WithAsyncMethod_ModelStreamInferINS7_27WithAsyncMethod_ModelConfigINS7_31WithAsyncMethod_ModelStatisticsINS7_31WithAsyncMethod_RepositoryIndexINS7_35WithAsyncMethod_RepositoryModelLoadINS7_37WithAsyncMethod_RepositoryModelUnloadINS7_40WithAsyncMethod_SystemSharedMemoryStatusINS7_42WithAsyncMethod_SystemSharedMemoryRegisterINS7_44WithAsyncMethod_SystemSharedMemoryUnregisterINS7_38WithAsyncMethod_CudaSharedMemoryStatusINS7_40WithAsyncMethod_CudaSharedMemoryRegisterINS7_42WithAsyncMethod_CudaSharedMemoryUnregisterINS7_28WithAsyncMethod_TraceSettingINS7_27WithAsyncMethod_LogSettingsINS7_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS6_18ModelInferResponseEEENS6_17ModelInferRequestES1F_E5StartEvEUlvE_EEEclEv (this=0x555557081168) at /usr/include/c++/11/bits/std_thread.h:266
#8  0x000055555570b0dc in _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS8_27WithAsyncMethod_ServerReadyINS8_26WithAsyncMethod_ModelReadyINS8_30WithAsyncMethod_ServerMetadataINS8_29WithAsyncMethod_ModelMetadataINS8_26WithAsyncMethod_ModelInferINS8_32WithAsyncMethod_ModelStreamInferINS8_27WithAsyncMethod_ModelConfigINS8_31WithAsyncMethod_ModelStatisticsINS8_31WithAsyncMethod_RepositoryIndexINS8_35WithAsyncMethod_RepositoryModelLoadINS8_37WithAsyncMethod_RepositoryModelUnloadINS8_40WithAsyncMethod_SystemSharedMemoryStatusINS8_42WithAsyncMethod_SystemSharedMemoryRegisterINS8_44WithAsyncMethod_SystemSharedMemoryUnregisterINS8_38WithAsyncMethod_CudaSharedMemoryStatusINS8_40WithAsyncMethod_CudaSharedMemoryRegisterINS8_42WithAsyncMethod_CudaSharedMemoryUnregisterINS8_28WithAsyncMethod_TraceSettingINS8_27WithAsyncMethod_LogSettingsINS8_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS7_18ModelInferResponseEEENS7_17ModelInferRequestES1G_E5StartEvEUlvE_EEEEE6_M_runEv (this=0x555557081160) at /usr/include/c++/11/bits/std_thread.h:211
#9  0x00007ffff6974253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007ffff6703ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#11 0x00007ffff6795850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Model Configuration seen here. Model is trained with Tensorflow 2.13 and is a saved_model.pb artifact

Expected behavior No Segfaults or server crashes

As we can see, the issue starts mainly with the grpc InferHandlerState and goes deeper into the Triton code which I am trying to study myself. I thought I will raise this issue here as it seems major and would like to get more eyes on this from the Triton community.

Please let me know if you need any more information from my end as well.

Thanks

Jun 24 '24 22:06 AshwinAmbal

On quick study of the stack trace, it could be related to a lower context timeout as well (10 Milliseconds) which could be causing this. Seen above as Context::IsCancelled. Will run some experiments to confirm the behavior.

Jun 25 '24 02:06 AshwinAmbal

I've confirmed that the issue was request cancellation. In production, we have varying timeouts per Inference Request. One particular set of requests had the timeout set in the range of [1-4] ms for end to end inference. This caused the segmentation fault and increasing it resolved the issue.

I was also reading on this and it seems request cancellation feature is still under development and is only currently supported for gRPC Python as seen here and here. We use gRPC Golang on the contrary.

@tanmayv25 this may be of interest to you as I see you have already worked on this part of the code here.

cc: @Tabrizian @dyastremsky @rmccorm4 as well

Let me know if you need any more details here.

For now, we are averting this by creating a Goroutine with a timeout and not providing a timeout for the inference requests itself.

Thanks

Jun 25 '24 18:06 AshwinAmbal

Thanks @AshwinAmbal for digging into this and sharing results of your experimentation.

So, if the timeout value is very small then you were running into this segfault? You are not sending request cancellation explicitly from the client right? Would you mind sharing your model execution time and rest of the latency breakdown?

Can you update the title of this issue to reflect the current issue?

Jul 01 '24 22:07 tanmayv25

Hi @tanmayv25,

So, if the timeout value is very small then you were running into this segfault? You are not sending request cancellation explicitly from the client right?

Yes. Low context timeouts sent from client using gRPC causes the segfault. We were sending request cancellation by setting the context timeout like done here except that at times our context timeouts can range between 1 ms - 4 ms which causes the Segfault.

Hence, to work around this issue we have created go routines which send the inference request to Triton with a high context timeout while the Goroutine itself has the timeout we expect for the request. In this case, if the timeout has been reached (1 ms - 4 ms), the main routine will return without waiting for the Goroutine to finish while the Goroutine itself will only complete after the Inference response is received (from Triton).

For example, pseudo code below for main routine is as follows:

func getPrediction() {
        resChan := make(chan *ResultType, size)
        go func(client grpcInferenceClient, modelName string, modelVersion string) {
                res := client.ModelInfer(context.Background(), &msg.ModelInferRequest) // <===== High Timeout or no timeout
		resChan <- res
        }()
        t := time.NewTimer(timeout)                    // <===== GoRoutine Timeout of 1 - 4 ms
	for {
		select {
		case r := <-resChan:
			// process result of model inference and return

		case <-t.C:
			return nil, error("triton inference time out")
		}
	}
}

Please also note that we are using Triton with CPU only for inference at this point.

Would you mind sharing your model execution time and rest of the latency breakdown?

The Average Inference Request Duration for the model is 1.04 ms as reported by Triton (nv_inference_request_duration_us / nv_inference_count). The E2E Inference Request Duration reported by the client for this particular model [including network RTT] is as follows:

Avg: 1.81 ms
p50: 1.74 ms, 
p95: 2.77 ms, 
p99: 3.13 ms

Can you update the title of this issue to reflect the current issue?

I believe the issue is Request Cancellation Timeout being low. I will update the title accordingly.

Let me know if you need any more details.

Thanks

Jul 02 '24 19:07 AshwinAmbal

Hi @AshwinAmbal, can you reproduce this issue with the 24.06 release? I think this change from @oandreeva-nv may possibly help the issue you're observing: https://github.com/triton-inference-server/server/pull/7325.

Jul 11 '24 21:07 rmccorm4

@rmccorm4 thanks for the response. We'll have a look and publish our findings here.

Jul 11 '24 23:07 AshwinAmbal

Hi @rmccorm4 I reproduced the issue with the 24.06 release with the new tfdf library of 1.9.1, Triton crashed with Signal (11) and Signal(6) when the timeout was 1 ms, It functioned well with bigger timeouts. These are the logs: {"log":"Signal (6) received.","stream":"stderr","time":"2024-07-12T21:01:04.293670304Z"} {"log":"Signal (11) received.","stream":"stderr","time":"2024-07-12T20:56:41.876752986Z"} and SegFault(6).

Jul 15 '24 17:07 Estevefact

@AshwinAmbal, @Estevefact, if possible, could you please share the issue reproduction model or model generation script, config.pbtxt, and client? This will help us quickly reproduce and investigate the issue. Thank you.

Jul 16 '24 11:07 pskiran1

@pskiran1 I've attached the client code and config.pbtxt in the issue description already. I have also mentioned about the latency to @tanmayv25 above

About the model, we believe it isn't model dependent at this time and can be reproduced with any ML model hosted in Triton which has a low context cancellation timeout. Unfortunately, due to privacy reasons, we will not be able to share the trained model artifact at this time and it may be a lengthy process to get the approval from our end.

The only difference from a normal client is that we sometimes set the context cancellation as low as 1 ms and we notice the segfault crash issue happen when this is done.

Let me know if you need any more details to help reproduce this on your end. I've also attached the debug trace from GDB for your perusal in the issue description as well

Jul 16 '24 15:07 AshwinAmbal

@pskiran1 I've attached the client code and config.pbtxt in the issue description already. I have also mentioned about the latency to @tanmayv25 above

The only difference from a normal client is that we sometimes set the context cancellation as low as 1 ms and we notice the segfault crash issue happen when this is done.

Let me know if you need any more details to help reproduce this on your end. I've also attached the debug trace from GDB for your perusal in the issue description as well

Hi @AshwinAmbal,

I have created the sample issue reproduction model and client (Python gRPC) using the information you provided. When I executed the client with a very low client_timeout, it worked fine as expected, and I was unable to reproduce the segfault. Please let us know if we are missing something here.

Jul 17 '24 19:07 pskiran1

Hi @AshwinAmbal,

I have created the sample issue reproduction model and client (Python gRPC) using the information you provided. When I executed the client with a very low client_timeout, it worked fine as expected, and I was unable to reproduce the segfault. Please let us know if we are missing something here.

@pskiran1 I believe there is a difference between the Python gRPC client and Golang gRPC client. We use the Golang client with gRPC. Can you try reproducing the issue with the code (Golang) given by us?

It might also be worth running Triton server on remote rather than localhost as there may not be much network latency when running it in local

Jul 17 '24 21:07 AshwinAmbal

@pskiran1 I believe there is a difference between the Python gRPC client and Golang gRPC client. We use the Golang client with gRPC. Can you try reproducing the issue with the code (Golang) given by us?

@AshwinAmbal, Ideally since the segfault is happening on the server side, we should be able to reproduce the issue using python client as well. However, I attempted to reproduce the error using a Go client as well, but unfortunately, I was unable to reproduce the error. I am currently investigating how we can reproduce this issue and analyzing the backtrace logs that were shared. Please feel free to let us know if we are missing something. If you can provide a minimal issue reproduction, that would be greatly appreciated. Also, I am using remote server and client side 4000 milli seconds as time out.

FLAGS: {global_dnn  10.117.3.165:8001}
2024/07/22 16:00:03 Error processing InferRequest: rpc error: code = DeadlineExceeded desc = context deadline exceeded
exit status 1

Note: In go client, I have used only float32 data type inputs.

CC: @tanmayv25

Jul 22 '24 16:07 pskiran1

@pskiran1 I'll look into the code you shared and see if I can find something. But from first glance your timeout is not small enough (4000 ms). Can you try setting it to a lower timeout between 1 ms - 4 ms and hit Triton with multiple similar requests at the same time?

Jul 22 '24 16:07 AshwinAmbal

@AshwinAmbal, sorry for the typo, I was trying with 4ms timeout and also multiple requests.

Jul 23 '24 13:07 pskiran1

@pskiran1 can you set the timeout lower to 1 ms instead and test one last time before I dig into this?

Jul 23 '24 18:07 AshwinAmbal

@pskiran1 can you set the timeout lower to 1 ms instead and test one last time before I dig into this?

Yes, @AshwinAmbal, I verified with 1ms. However, if we keep it 1ms, the request times out before reaching the server. On the same host, it reaches the server, but the segfault is not reproducible.

Jul 24 '24 08:07 pskiran1

@pskiran1 Can you send requests in a loop with an increasing value of timeout? From 1ms to 4 or 5ms with an increment of 0.1ms? Just to see if the segfault occurs for a special case. Also I would advice looking at any known issue with grpc which might describe the known issue.

Jul 25 '24 17:07 tanmayv25

@pskiran1 @tanmayv25 Let me try to get someone from my team to work on this while I am away on holidays. Leaving a comment here so that we don't close this issue due to inactivity :)

Jul 26 '24 20:07 AshwinAmbal

We are also seeing same issue. Do we have any resolution for this?

Sep 12 '24 13:09 sushrutikhar

@AshwinAmbal @tanmayv25 @pskiran1 Here are the conditions which the issue was reproduced. Could you please check if the same error can be reproduced under these conditions on your side as well?

triton version: nvcr.io/nvidia/tritonserver:24.08-py3
Model: Using the example model from here
Client: Modified the example from here, inference timeout is set to 1ms, 300 requests per second, during 100 seconds
the arguments for the tritonserver

gdb --args /opt/tritonserver/bin/tritonserver --model-store=/models \
	    --model-control-mode=explicit \
	    --log-verbose=0 \
	    --exit-on-error=false \
	    --metrics-config summary_latencies=true \
	    --metrics-config summary_quantiles="0.5:0.05,0.95:0.001,0.99:0.001" \
	    --strict-readiness=false \
    --allow-cpu-metrics=true

Results:

I1021 04:57:40.024634 1 model_lifecycle.cc:839] "successfully loaded 'simple'"
Signal (11) received.
 0# 0x00005B0EDE4D983D in tritonserver
 1# 0x000076E16C537520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_mutex_lock in /lib/x86_64-linux-gnu/libc.so.6
 3# 0x00005B0EDE53A56B in tritonserver
 4# 0x00005B0EDE53034B in tritonserver
 5# 0x000076E16E4B0253 in /lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x000076E16C589AC3 in /lib/x86_64-linux-gnu/libc.so.6
 7# clone in /lib/x86_64-linux-gnu/libc.so.6

Thread 39 "tritonserver" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fca2cc00000 (LWP 586)]
___pthread_mutex_lock (mutex=0x20) at ./nptl/pthread_mutex_lock.c:80
80	./nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt
#0  ___pthread_mutex_lock (mutex=0x20) at ./nptl/pthread_mutex_lock.c:80
#1  0x00005ed51d52256b in triton::server::grpc::ModelInferHandler::Process(triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>*, bool) ()
#2  0x00005ed51d51834b in _ZZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS4_27WithAsyncMethod_ServerReadyINS4_26WithAsyncMethod_ModelReadyINS4_30WithAsyncMethod_ServerMetadataINS4_29WithAsyncMethod_ModelMetadataINS4_26WithAsyncMethod_ModelInferINS4_32WithAsyncMethod_ModelStreamInferINS4_27WithAsyncMethod_ModelConfigINS4_31WithAsyncMethod_ModelStatisticsINS4_31WithAsyncMethod_RepositoryIndexINS4_35WithAsyncMethod_RepositoryModelLoadINS4_37WithAsyncMethod_RepositoryModelUnloadINS4_40WithAsyncMethod_SystemSharedMemoryStatusINS4_42WithAsyncMethod_SystemSharedMemoryRegisterINS4_44WithAsyncMethod_SystemSharedMemoryUnregisterINS4_38WithAsyncMethod_CudaSharedMemoryStatusINS4_40WithAsyncMethod_CudaSharedMemoryRegisterINS4_42WithAsyncMethod_CudaSharedMemoryUnregisterINS4_28WithAsyncMethod_TraceSettingINS4_27WithAsyncMethod_LogSettingsINS4_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS3_18ModelInferResponseEEENS3_17ModelInferRequestES1C_E5StartEvENKUlvE_clEv ()
#3  0x00007fca604b0253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fca5e589ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5  0x00007fca5e61aa04 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Oct 21 '24 05:10 ludwings0330

We are currently experiencing the same issue. In our case, once we start receiving cancel requests, our pods begin restarting. This creates a domino effect: latency increases, leading to more cancel requests, and eventually, the entire cluster turns down.

To mitigate this, we implemented two solutions to reduce cancel requests from the client side:

We added a timeout for the queue to reject requests before reaching the client (default_timeout_microseconds). However, this alone didn’t fully resolve the issue, as it seems Triton attempts to cancel some requests even though they have already been canceled. We increased the client timeout to exceed the p99 latency, aiming to avoid cancel requests altogether. These two configurations have solved the problem for now.

triton version: nvcr.io/nvidia/tritonserver:24.08-py3

Oct 22 '24 08:10 topuzm15

I thought one possible cause could be that the server sends a response and deletes the inference object, but just before the client receives the response, a timeout occurs, causing the client to send a cancellation request. As a result, on the server side, after responding and cleaning up the object, a call to cancellation function leads to a exception.

Oct 23 '24 00:10 ludwings0330

Thanks @ludwings0330, we could reproduce this issue randomly using the simple model and repeatedly running simple_grpc_async_infer_client.py. We prioritized this issue and working on a fix.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `tritonserver --model-repository=./model_repository/ --load-model=simple --log-v'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  ___pthread_mutex_lock (mutex=0x20) at ./nptl/pthread_mutex_lock.c:80
80      ./nptl/pthread_mutex_lock.c: No such file or directory.
[Current thread is 1 (Thread 0x7f12dbfff000 (LWP 223))]
(gdb) bt
#0  ___pthread_mutex_lock (mutex=0x20) at ./nptl/pthread_mutex_lock.c:80
#1  0x000055f2c59cd7e7 in __gthread_mutex_lock (__mutex=0x20) at /usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749
#2  0x000055f2c5a04090 in __gthread_recursive_mutex_lock (__mutex=0x20) at /usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:811
#3  0x000055f2c5a04ee2 in std::recursive_mutex::lock (this=0x20) at /usr/include/c++/11/mutex:108
#4  0x000055f2c5a5426e in std::lock_guard<std::recursive_mutex>::lock_guard (this=0x7f12dbffb2e0, __m=...) at /usr/include/c++/11/bits/std_mutex.h:229
#5  0x000055f2c5a6ba25 in triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::Context::IsCancelled (this=0x0) at /workspace/src/grpc/infer_handler.h:677
#6  0x000055f2c5a6799c in triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::IsGrpcContextCancelled (this=0x7f12dc4764f0) at /workspace/src/grpc/infer_handler.h:1086
#7  0x000055f2c5a6147b in triton::server::grpc::ModelInferHandler::Process (this=0x55f2c8b350b0, state=0x7f12dc4764f0, rpc_ok=true)
    at /workspace/src/grpc/infer_handler.cc:701
#8  0x000055f2c5a48131 in _ZZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS4_27WithAsyncMethod_ServerReadyINS4_26WithAsyncMethod_ModelReadyINS4_30WithAsyncMethod_ServerMetadataINS4_29WithAsyncMethod_ModelMetadataINS4_26WithAsyncMethod_ModelInferINS4_32WithAsyncMethod_ModelStreamInferINS4_27WithAsyncMethod_ModelConfigINS4_31WithAsyncMethod_ModelStatisticsINS4_31WithAsyncMethod_RepositoryIndexINS4_35WithAsyncMethod_RepositoryModelLoadINS4_37WithAsyncMethod_RepositoryModelUnloadINS4_40WithAsyncMethod_SystemSharedMemoryStatusINS4_42WithAsyncMethod_SystemSharedMemoryRegisterINS4_44WithAsyncMethod_SystemSharedMemoryUnregisterINS4_38WithAsyncMethod_CudaSharedMemoryStatusINS4_40WithAsyncMethod_CudaSharedMemoryRegisterINS4_42WithAsyncMethod_CudaSharedMemoryUnregisterINS4_28WithAsyncMethod_TraceSettingINS4_27WithAsyncMethod_LogSettingsINS4_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS3_18ModelInferResponseEEENS3_17ModelInferRequestES1C_E5StartEvENKUlvE_clEv (__closure=0x55f2c8b304c8)
    at /workspace/src/grpc/infer_handler.h:1379

Oct 23 '24 13:10 pskiran1

We have encountered the same issue. Could somebody let me know if it is going to be fixed and if so, which version of triton server is expected to have such a fix?

Nov 27 '24 17:11 ilja2209

+1 same issue.

Jan 16 '25 20:01 DZADSL72-00558

@yinggeh Could you please update this issue? I see you had a fix, was it released with 24.12, or it will be in 25.01 ?

Jan 16 '25 21:01 oandreeva-nv

@AshwinAmbal The bug has been fixed in release 24.12. Please deploy with the latest Triton container and let me know if there is stilll an issue. Thanks for the patience. cc @Estevefact @sushrutikhar @ludwings0330 @topuzm15 @ilja2209 @DZADSL72-00558

Jan 16 '25 21:01 yinggeh

server server copied to clipboard

gRPC Segfaults in Triton 24.06 due to Low Request Cancellation Timeout

server
server copied to clipboard