numaflow-python icon indicating copy to clipboard operation
numaflow-python copied to clipboard

Graceful shutdown of gRPC servers when there are exceptions in the User Code

Open kohlisid opened this issue 1 year ago • 7 comments

on returning exception from executor layer in the sync servers in some of the cases we were updating the gRPC context with errors with parallel running thread even after we shutdown the gRPC server. This updated the context , overriding the actual reason for restart. So we stopped returning errors from executor layer as part of this PR - https://github.com/numaproj/numaflow-python/pull/218

kohlisid avatar Oct 30 '24 19:10 kohlisid

Hey! I’m currently exploring gRPC as part of a project involving llama.cpp, and this issue looks like a great way to understand it better. If no one is working on it, may I take it?

sapkota-aayush avatar Jun 27 '25 23:06 sapkota-aayush

Hey! I’m currently exploring gRPC as part of a project involving llama.cpp, and this issue looks like a great way to understand it better. If no one is working on it, may I take it?

thank you, shall i assign this to you?

vigith avatar Jul 09 '25 23:07 vigith

Hey! I’m currently exploring gRPC as part of a project involving llama.cpp, and this issue looks like a great way to understand it better. If no one is working on it, may I take it?

thank you, shall i assign this to you?

Yes, please.

sapkota-aayush avatar Jul 09 '25 23:07 sapkota-aayush

Thanks @sapkota-aayush for this up! Like always please feel free to discuss and we can take it forward :)

kohlisid avatar Jul 10 '25 00:07 kohlisid

Hi @vigith and @kohlisid! I've been studying the race condition issue and I think i have understood the problem now. When multiple threads crash simultaneously, only the first error gets reported to the client while others are lost. I can see that PR #218 fixed async servers but sync servers still have the race condition in the exit_on_error() function. Where should I start? Should I focus on implementing graceful shutdown for sync servers first?

sapkota-aayush avatar Jul 12 '25 21:07 sapkota-aayush

Hey @sapkota-aayush Let's start with sync server flow first! And ensure consistency between both Then we can take it further

kohlisid avatar Jul 14 '25 00:07 kohlisid

Hey @sapkota-aayush Let's start with sync server flow first! And ensure consistency between both Then we can take it further

@kohlisid Perfect! I'll start with sync server flow first and ensure consistency. My plan: Fix SyncMapServicer with thread-safe shutdown control Apply same pattern to all other sync servers for consistency Test race condition scenarios I'll start with SyncMapServicer and then propagate the fix to maintain consistency across all sync servers before any additional enhancements. Does this sound good?

sapkota-aayush avatar Jul 15 '25 00:07 sapkota-aayush