seastar
seastar copied to clipboard
RPC: failed send_buffer() hangs the connection
If artificially aborting the call for connection::send_buffer
with exception, the send_helper
doesn't wake up. E.g. this patch
--- a/src/rpc/rpc.cc
+++ b/src/rpc/rpc.cc
@@ -2,6 +2,7 @@
#include <seastar/core/align.hh>
#include <seastar/core/seastar.hh>
#include <seastar/core/print.hh>
+#include <seastar/core/sleep.hh>
#include <seastar/core/future-util.hh>
#include <boost/range/adaptor/map.hpp>
@@ -159,6 +160,13 @@ namespace rpc {
d.buf.size -= 8;
}
}
+ if (QueueType == connection::outgoing_queue_type::request) {
+ return seastar::sleep(std::chrono::seconds(1)).then([d = std::move(d)] {
+ fmt::print("fail sending\n");
+ return make_exception_future<>(std::runtime_error("fail"));
+ });
+ }
+
d.buf = compress(std::move(d.buf));
auto f = send_buffer(std::move(d.buf)).then([this] {
_stats.sent_messages++;
applied to pre- c437920c23aac97d30fdba1e1c6554f15d98eb5a triggers this. On master it hangs without sleep, old code needs sleep to let client::client
step into read_response_frame_compressed
and sleep
found while debugging #1145
It is not a connection hand but individual RPC call that hands, no? And if the RPC is done with a timeout the timeout will be triggered, right? Also the if RPC call is cancelable it can be canceled. So the problem is that if non cancelable two way RPC without timeout fails to send data it will wait forever for a reply. Correct?
It is not a connection hand but individual RPC call that hands, no?
If you call send()
again it will hit the if (_error)
and resolve with exception, but send()
's future is not the only thing that's waited upon, send_helper
also waits on wait_for_reply
. Will it hang as well? I haven't checked yet, but I see no if (_error)
checks in that path, so it probably will.
And if the RPC is done with a timeout the timeout will be triggered, right?
The timeout will be triggered, and it would abort the single _outstanding
, yes.
So the problem is that if non cancelable two way RPC without timeout fails to send data it will wait forever for a reply. Correct?
Correct, but it's not all. Again, I haven't checked deeper yet, but my concern is that client::client
's main do_while
loop doesn't resolve and the .then_wrapped()
continuation (that clears _outstanding
and wakes up send_helpers) doesn't happen.
It is not a connection hand but individual RPC call that hands, no?
If you call
send()
again it will hit theif (_error)
and resolve with exception,
Ah, I see now that if we fail to send for any reason we mark the connection as failed. We needs to wake up a receiver side as well for it to see that connection should be closed.
but
send()
's future is not the only thing that's waited upon,send_helper
also waits onwait_for_reply
. Will it hang as well? I haven't checked yet, but I see noif (_error)
checks in that path, so it probably will.
wait_for_reply() waits for a reply, timeout or cancelation.
And if the RPC is done with a timeout the timeout will be triggered, right?
The timeout will be triggered, and it would abort the single
_outstanding
, yes.So the problem is that if non cancelable two way RPC without timeout fails to send data it will wait forever for a reply. Correct?
Correct, but it's not all. Again, I haven't checked deeper yet, but my concern is that
client::client
's maindo_while
loop doesn't resolve and the.then_wrapped()
continuation (that clears_outstanding
and wakes up send_helpers) doesn't happen.
do_until
check _error
, but it needs to be woken up if it sleeps on a receive.
do_until check _error, but it needs to be woken up if it sleeps on a receive.
:+1:
Not only this. The batch-flush is also broken in this sense:
output_stream<CharType>::poll_flush() noexcept {
...
// FIXME: future is discarded
(void)f.then([this] {
return _fd.flush();
}).then_wrapped([this] (future<> f) {
try {
f.get();
} catch (...) {
_ex = std::current_exception(); <<<<< HERE
}
// if flush() was called while flushing flush once more
poll_flush();
});
}
if excpetion pops up at $HERE, no RPC code gets notified, send path resolves without exception and recv part continues waiting for the response frame
@xemul any updates?
If we can not fix the issue soon, let's add some warning in the log so we know if this happens in the deployment. We have seen rpc response was not received for unknown reasons: https://github.com/scylladb/scylladb/issues/11354#issuecomment-1235161009.
@asias , this issue happened to be tied with #1150 . I've a set that fixes both, but the whole thing turns out to be more delicate than I expected and needs more care
@asias , this issue happened to be tied with #1150 . I've a set that fixes both, but the whole thing turns out to be more delicate than I expected and needs more care
What about the "adding warning logs first so we know this happens in practice" approach?
It is very hard to debug such problems in production. To know the hang has happened is very very useful already.
I think I'll send a fix for this particular issue without test and without #1150 fix
Thanks. Can you share the link of the fix here to make it easier to track ;-)