numaflow-python
numaflow-python copied to clipboard
Graceful shutdown of gRPC servers when there are exceptions in the User Code
on returning exception from executor layer in the sync servers in some of the cases we were updating the gRPC context with errors with parallel running thread even after we shutdown the gRPC server. This updated the context , overriding the actual reason for restart. So we stopped returning errors from executor layer as part of this PR - https://github.com/numaproj/numaflow-python/pull/218
Hey! I’m currently exploring gRPC as part of a project involving llama.cpp, and this issue looks like a great way to understand it better. If no one is working on it, may I take it?
Hey! I’m currently exploring gRPC as part of a project involving llama.cpp, and this issue looks like a great way to understand it better. If no one is working on it, may I take it?
thank you, shall i assign this to you?
Hey! I’m currently exploring gRPC as part of a project involving llama.cpp, and this issue looks like a great way to understand it better. If no one is working on it, may I take it?
thank you, shall i assign this to you?
Yes, please.
Thanks @sapkota-aayush for this up! Like always please feel free to discuss and we can take it forward :)
Hi @vigith and @kohlisid! I've been studying the race condition issue and I think i have understood the problem now. When multiple threads crash simultaneously, only the first error gets reported to the client while others are lost. I can see that PR #218 fixed async servers but sync servers still have the race condition in the exit_on_error() function. Where should I start? Should I focus on implementing graceful shutdown for sync servers first?
Hey @sapkota-aayush Let's start with sync server flow first! And ensure consistency between both Then we can take it further
Hey @sapkota-aayush Let's start with sync server flow first! And ensure consistency between both Then we can take it further
@kohlisid Perfect! I'll start with sync server flow first and ensure consistency. My plan: Fix SyncMapServicer with thread-safe shutdown control Apply same pattern to all other sync servers for consistency Test race condition scenarios I'll start with SyncMapServicer and then propagate the fix to maintain consistency across all sync servers before any additional enhancements. Does this sound good?