scylladb generic_server::server::shutdown() isn't idempotent wrt. reentrancy

(Filing this from the thread started at https://github.com/scylladb/scylladb/pull/20212#discussion_r1732854202.)

@avikivity wrote,

A first shutdown() call will start tearing things down, wait for wheels to turn, and return when done. A second shutdown that is initiated before the first completes will return immediately, misinforming its caller. This can trigger if the user requests a shutdown (I think it's possible via nodetool disablebinary) and immediately afterwards the server is shut down. server::stop() will flash through server::shutdown() and then the shutdown process will hit use-after-free on its connections.

@bhalevy wrote,

the second shutdown() indeed has nothing to do. and then stop() would wait on the _gate.close() future set to the second shutdown() indeed has nothing to _all_connections_stopped by the first shutdown() call.

@avikivity wrote,

We have shared_future for this.

Tasks:

try to trigger the race (deterministically? ... test case?)
attempt writing a fix with shared_future

Not sure how far this goes back (a bunch of git-blame commands could likely help); master is at 5c0f6d461321 as of this writing. PR #20212, while it may perturb things slightly, is supposed to neither fix nor cause this symptom.

Aug 27 '24 15:08 lersek

@avikivity @bhalevy @tchaikov

(1) I think that we should revert the last three commits from PR #20212, namely (in this order):

49bff3b1ab68 generic_server: make server::stop() idempotent
1138347e7eb3 generic_server: coroutinize server::shutdown()
2216275ebd6d generic_server: make server::shutdown() idempotent

For two reasons:

the coroutinization patch breaks an interface contract,
the "idempotent" patches don't help with reentrancy, and I intend to solve that separately, and more broadly, with shared_future, under this ticket.

(2) Here's why IMO the coroutinization patch breaks the interface. generic_server.hh says, in class server:

    // Makes sure listening sockets no longer generate new connections and aborts the
    // connected sockets, so that new requests are not served and existing requests don't
    // send responses back.
    //
    // It does _not_ wait for any internal activity started by the established connections
    // to finish. It's the .stop() method that does it
    future<> shutdown();
    future<> stop();

The point is that shutdown() must not block the fiber that calls it. We added a bunch of co_awaits in commit 1138347e7eb3, and that breaks the interface. While those co_awaits will certainly not block the heavy-weight (POSIX) reactor thread of the shard, they will block the fiber that calls shutdown().

This is why we should revert the coroutinization. The responsibility of shutdown() is just to create a graph of futures -- which the reactor will gradually process in the background. Any waiting for that graph of futures belongs in stop().

(3) The complete graph of futures, for shutting down the server, should look like this:

                              when_all
                             /        \
                     [depends on]   [depends on]
                           /            \
             _gate is empty              _listeners_stopped
                   |
               [depends on]
                   |
  parallel_for_each(_connections_list)
       /           |            \
[depends on]  [depends on]   [depends on]
     /             |              \
c.shutdown()  c.shutdown()         c.shutdown()

I have several notes about this graph; below.

(4) _listeners_stopped is actually a "maximally unbalanced" binary tree of when_all futures; set up by the sequence of listen() calls. The graph omits this "ladder" for brevity; what's important is that _listeners_stopped is always the root of the ladder.

(5) The _all_connections_stopped future (which currently saves the retval of _gate.close()) will no longer need to be maintained as an explicit field; it can just exist as a future chained with then to the parallel_for_each. The idea is that the gate will never drain before all connections shut down, so moving _gate is empty right under the topmost when_all is useless.

(6) In the generic_server::connection base class, the generic_server::connection::shutdown() method is actually synchronous, therefore, parallel_for_each() decays to a normal (sequential) iteration in practice. (Their order is unspecified, but they will not overlap in effect.) However, with a derived connection class, multiple shutdowns might indeed be in flight at the same time.

(7) Replacing the _all_connections_stopped field (which currently stands for the gate having been drained), we need to introduce a shared_future field. The great thing about shared_future is that it can be in an invalid ("disengaged") state, which can be checked with the valid() method. And that disengaged state is what the default shared_future constructor creates.

Therefore, shutdown() is supposed to perform the following steps:

Check if the shared future is valid; if it is, return its get_future() immediately.
Otherwise:
- close the gate
- abort accept on all listeners
- create the above-noted graph of futures
- wrap it into the shared_future field (with a new shared_future object replacing the original disengaged object)
- return get_future() from that object

None of these steps ever wait, so (a) reentrancy should be solved by them, (b) they unbreak the interface contract.

(8) The stop() method just co-await shutdown() -- wait until the entire graph resolves.

(9) Holding the gate in do_accepts() seems superfluous, as do_accepts() must resolve anyway as a pre-requisite for _listeners_stopped to resolve.

Furthermore, _gate.is_closed() in do_accepts() should be possible to replace with "is the shared future now valid?" as the shutdown signal.

(10) Which in turn questions why _gate should exist at all. _gate's only remaining purpose is that there could be a time window where shutdown() may have resolved on every connection, but some connections may not have been destructed yet. I can't prove that this time window does not exist though, so I guess for safety's sake I'll keep _gate.

(11) For simplicity's sake, the topmost connection need not be a when_all; it can be a simple then too. when_all better reflects however that the listeners (the do_accepts() repeats) and the connections shut down independently of each other.

Before I start coding this up: are you all OK with me undoing to coroutinization of shutdown()? (I think it's a must, because it now breaks the interface contract.)

Sep 02 '24 16:09 lersek

@avikivity @bhalevy @tchaikov

(1) I think that we should revert the last three commits from PR #20212, namely (in this order):

You mean the ones that were already nominated for backport?

Sep 02 '24 17:09 avikivity

Before I start coding this up: are you all OK with me undoing to coroutinization of shutdown()? (I think it's a must, because it now breaks the interface contract.)

I don't understand how coroutinization breaks anything. Maybe you changed the semantics during the transform (usually a bad idea), but by itself coroutinization shouldn't have mattered.

Sep 02 '24 17:09 avikivity

@avikivity @bhalevy @tchaikov (1) I think that we should revert the last three commits from PR #20212, namely (in this order):

You mean the ones that were already nominated for backport?

Yes, exactly. Those three commits that you had (independently) indicated shouldn't be backported. For this reason I marked #20349 and #20355 as drafts for now.

Sep 02 '24 17:09 lersek

how coroutinization breaks anything

Before commit 1138347e7eb3, server::shutdown() would not wait; in other words, it would not depend on any futures being resolved. Afterwards, server::shutdown() waits for all the connection shutdowns to complete (= until the "parallel for each" resolves), and also for _listeners_stopped to resolve. Functionally this is not a bug, but the API spec emphasizes that shutdown() never waits for these futures to resolve.

Sep 02 '24 17:09 lersek

Ping. I'd still like to get advice on this.

Sep 06 '24 21:09 lersek

how coroutinization breaks anything

Before commit 1138347, server::shutdown() would not wait; in other words, it would not depend on any futures being resolved. Afterwards, server::shutdown() waits for all the connection shutdowns to complete (= until the "parallel for each" resolves), and also for _listeners_stopped to resolve. Functionally this is not a bug, but the API spec emphasizes that shutdown() never waits for these futures to resolve.

@lersek I'm not sure where it's documented and how important it is. If the scope of the api is internal to scylla then we should just audit the callers. Now, what are we waiting for exactly?

Although the difference between shutdown() and stop() is not documented anywhere, typically shutdown() entitles only a "signal" to shut down, while stop() (that may call shutdown()) waits for all async futures to resolve.

Sometimes sending a shutdown "signal" is async as it needs to be invokes on a different shard or for other reasons, but waiting for it doesn't mean that we wait until the resource actually shutsdown, but rather until it processed the shutdown signal (irreversibly)

If waiting on the subordinate async shutdown calls is a problem, then we can keep the returned future and "join" it with the future returned by gate.close() using when_all, and let that be the future stop() would wait on, but ai doubt it's a real issue...

Sep 07 '24 04:09 bhalevy

@bhalevy

@lersek I'm not sure where it's documented

Please see bullet (2) in my comment above. The specification in question is in generic_server.hh, class server, in the comment on shutdown():

    // It does _not_ wait for any internal activity started by the established connections
    // to finish. It's the .stop() method that does it

and how important it is. If the scope of the api is internal to scylla then we should just audit the callers.

There is one call site; cql_transport::controller::do_stop_server(), in transport/controller.cc. It uses sharded::invoke_on_all() to submit a shutdown() call to the cql_server service instance of each shard.

The future returned by sharded::invoke_on_all() is co_await-ed directly.

Subsequently, sharded::stop() is used to stop() and destroy each instance.

sharded::invoke_on_all is supposed to return a "future that becomes ready once all calls have completed". It is in turn based on parallel_for_each() and smp::submit_to().

Thus, while I think the coroutinization of generic_server::server::shutdown() breaks the API spec in generic_server.hh, I agree it should not be observable in practice. The change should not prevent sharded::invoke_on_all() from submitting the shutdown call on all shards concurrently, plus the future returned by invoke_on_all is waited for at once, in cql_transport::controller::do_stop_server().

Sometimes sending a shutdown "signal" is async as it needs to be invokes on a different shard or for other reasons, but waiting for it doesn't mean that we wait until the resource actually shutsdown, but rather until it processed the shutdown signal (irreversibly)

I think this is the core question. After the coroutinization of generic_server::server::shutdown(), that function will not return until all connections complete shutting down. Does this fact conflict with the language does _not_ wait for any internal activity started by the established connections to finish?

Sep 09 '24 10:09 lersek

Anyway, I'll try to write some code and submit some specifics to discuss at least. Thanks for the comments!

Sep 09 '24 10:09 lersek

A first shutdown() call will start tearing things down, wait for wheels to turn, and return when done. A second shutdown that is initiated before the first completes will return immediately, misinforming its caller. This can trigger if the user requests a shutdown (I think it's possible via nodetool disablebinary) and immediately afterwards the server is shut down. server::stop() will flash through server::shutdown() and then the shutdown process will hit use-after-free on its connections.

That doesn't happen. The nodetool disablebinary calls transport::controller::request_stop_server() which locks the _ops_sem. If immediately afterwards the server is shut down, it calls transport::controller::stop_server() which waits for the _ops_sem int the first place (then stops the server if it's not yet stopped), thus it cannot flash through that nodetool request

Sep 09 '24 16:09 xemul

Should I close this, and #20498?

Sep 09 '24 16:09 lersek

This issue is about generic_server, while my comment was about cql::transport only. If this issue means that cql::transport is broken and should be fixed via generic_server, then yes, let's close it and #20498, but if there's some other intent then let's discuss it

Sep 09 '24 17:09 xemul

The only two classes that are derived, to my understanding, from generic_server::server, are redis_server and cql_server; and I've not heard of redis_server in any context yet. So yes, AFAICT, this was motivated by cql_server.

Sep 10 '24 05:09 lersek

I think this is the core question. After the coroutinization of generic_server::server::shutdown(), that function will not return until all connections complete shutting down. Does this fact conflict with the language does _not_ wait for any internal activity started by the established connections to finish?

It doesn't like a problem in pratice, so I'd leave it this way and adjust the documentation to say that it waits for connections to shutdown, and any residual async work left behind is waited for in stop().

Sep 10 '24 06:09 bhalevy

Thanks. I intend to close this and #20498 tomorrow. @avikivity is that OK with you?

Sep 10 '24 07:09 lersek