seastar RPC: failed send_buffer() hangs the connection

If artificially aborting the call for connection::send_buffer with exception, the send_helper doesn't wake up. E.g. this patch

--- a/src/rpc/rpc.cc
+++ b/src/rpc/rpc.cc
@@ -2,6 +2,7 @@
 #include <seastar/core/align.hh>
 #include <seastar/core/seastar.hh>
 #include <seastar/core/print.hh>
+#include <seastar/core/sleep.hh>
 #include <seastar/core/future-util.hh>
 #include <boost/range/adaptor/map.hpp>
 
@@ -159,6 +160,13 @@ namespace rpc {
                       d.buf.size -= 8;
                   }
               }
+              if (QueueType == connection::outgoing_queue_type::request) {
+                  return seastar::sleep(std::chrono::seconds(1)).then([d = std::move(d)] {
+                      fmt::print("fail sending\n");
+                      return make_exception_future<>(std::runtime_error("fail"));
+                  });
+              }
+
               d.buf = compress(std::move(d.buf));
               auto f = send_buffer(std::move(d.buf)).then([this] {
                   _stats.sent_messages++;

applied to pre- c437920c23aac97d30fdba1e1c6554f15d98eb5a triggers this. On master it hangs without sleep, old code needs sleep to let client::client step into read_response_frame_compressed and sleep

found while debugging #1145

Jul 18 '22 14:07 xemul

It is not a connection hand but individual RPC call that hands, no? And if the RPC is done with a timeout the timeout will be triggered, right? Also the if RPC call is cancelable it can be canceled. So the problem is that if non cancelable two way RPC without timeout fails to send data it will wait forever for a reply. Correct?

Jul 19 '22 08:07 gleb-cloudius

It is not a connection hand but individual RPC call that hands, no?

If you call send() again it will hit the if (_error) and resolve with exception, but send()'s future is not the only thing that's waited upon, send_helper also waits on wait_for_reply. Will it hang as well? I haven't checked yet, but I see no if (_error) checks in that path, so it probably will.

And if the RPC is done with a timeout the timeout will be triggered, right?

The timeout will be triggered, and it would abort the single _outstanding, yes.

So the problem is that if non cancelable two way RPC without timeout fails to send data it will wait forever for a reply. Correct?

Correct, but it's not all. Again, I haven't checked deeper yet, but my concern is that client::client's main do_while loop doesn't resolve and the .then_wrapped() continuation (that clears _outstanding and wakes up send_helpers) doesn't happen.

Jul 19 '22 08:07 xemul

It is not a connection hand but individual RPC call that hands, no?

If you call send() again it will hit the if (_error) and resolve with exception,

Ah, I see now that if we fail to send for any reason we mark the connection as failed. We needs to wake up a receiver side as well for it to see that connection should be closed.

but send()'s future is not the only thing that's waited upon, send_helper also waits on wait_for_reply. Will it hang as well? I haven't checked yet, but I see no if (_error) checks in that path, so it probably will.

wait_for_reply() waits for a reply, timeout or cancelation.

And if the RPC is done with a timeout the timeout will be triggered, right?

The timeout will be triggered, and it would abort the single _outstanding, yes.

So the problem is that if non cancelable two way RPC without timeout fails to send data it will wait forever for a reply. Correct?

Correct, but it's not all. Again, I haven't checked deeper yet, but my concern is that client::client's main do_while loop doesn't resolve and the .then_wrapped() continuation (that clears _outstanding and wakes up send_helpers) doesn't happen.

do_until check _error, but it needs to be woken up if it sleeps on a receive.

Jul 19 '22 08:07 gleb-cloudius

do_until check _error, but it needs to be woken up if it sleeps on a receive.

:+1:

Jul 19 '22 09:07 xemul

Not only this. The batch-flush is also broken in this sense:

output_stream<CharType>::poll_flush() noexcept {
    ...
    // FIXME: future is discarded
    (void)f.then([this] {
        return _fd.flush();
    }).then_wrapped([this] (future<> f) {
        try {
            f.get();
        } catch (...) {
            _ex = std::current_exception();    <<<<< HERE
        }
        // if flush() was called while flushing flush once more
        poll_flush();
    });
}

if excpetion pops up at $HERE, no RPC code gets notified, send path resolves without exception and recv part continues waiting for the response frame

Jul 22 '22 06:07 xemul

@xemul any updates?

Oct 09 '22 09:10 asias

If we can not fix the issue soon, let's add some warning in the log so we know if this happens in the deployment. We have seen rpc response was not received for unknown reasons: https://github.com/scylladb/scylladb/issues/11354#issuecomment-1235161009.

Oct 09 '22 09:10 asias

@asias , this issue happened to be tied with #1150 . I've a set that fixes both, but the whole thing turns out to be more delicate than I expected and needs more care

Oct 11 '22 02:10 xemul

@asias , this issue happened to be tied with #1150 . I've a set that fixes both, but the whole thing turns out to be more delicate than I expected and needs more care

What about the "adding warning logs first so we know this happens in practice" approach?

It is very hard to debug such problems in production. To know the hang has happened is very very useful already.

Oct 11 '22 02:10 asias

I think I'll send a fix for this particular issue without test and without #1150 fix

Oct 11 '22 03:10 xemul

Thanks. Can you share the link of the fix here to make it easier to track ;-)

Oct 17 '22 02:10 asias

seastar seastar copied to clipboard

RPC: failed send_buffer() hangs the connection

seastar
seastar copied to clipboard