seastar icon indicating copy to clipboard operation
seastar copied to clipboard

Warn if RPC server handlers are not exiting fast enough during stopping time

Open xemul opened this issue 2 years ago • 6 comments

When rpc::server stops it kicks accept abort, stops connections and waits for all of them to resolve relevant stopping futures. One of the stopping promises is that the _reply_gate is closed. Inside this gate server runs responses to messages, but also the gate enter-exit spans over the message handler callback to resolve.

While all the kicked-and-waited stuff is internal to RPC, handlers are external and seastar has no way to ensure they finish in a timely manner, it can only assume that. The PR is supposed to facilitate debugging this assumption by counting handler callback entrances and exits (i.e. future resolutions) and warn in server stop doesn't finish within a minute printing the number of currently not-exited handlers.

Stuck server.stop() with zero running handlers is still an internal issue to result.

xemul avatar Apr 20 '23 14:04 xemul

@scylladb/seastar-maint , review ping

xemul avatar May 15 '23 16:05 xemul

@scylladb/scylla-maint , review ping

xemul avatar May 23 '23 08:05 xemul

upd:

  • print registered handler with non-zero use_gate counter to give a glue to which verbs are stuck

xemul avatar May 29 '23 05:05 xemul

upd:

  • decipher printed stuck ids by putting commas between numbers :facepalm:

xemul avatar May 29 '23 12:05 xemul

@scylladb/seastar-maint , merge ping

xemul avatar Jun 05 '23 06:06 xemul

I don't know if 1 minute is a universally good metric. In debug mode things can take longer. Running scylladb's parallel aggregation can take more.

avikivity avatar Jun 05 '23 09:06 avikivity