redpanda
redpanda copied to clipboard
rpc/transport: Release resource units right after dispatching send
Cover letter
Currently we have a timer that throws in case of timeouts but does not actually cleanup the underlying resource units. So if the dispatch of send blocks due to a remote issue (eg: in flush), the resources are held up until that returns or fails. In the linked issue, we believe the sequence of events is as follows.
- raft
_op_lockis grabbed by replicate_entries_stm and propagated to the transport#send (append_entries) - Test issues a shut down (blocks on consensus stop, request in progress)
- send blocked for 2+ mins holding the op_lock resources (to another node in a questionable state)
- Test timed out after 60 secs
- send returned, request failed and the shutdown got unblocked.
Looking closely, we can actually release the units early as dispatch of send guarantees the ordering for us by using a semaphore.
Fixes #5324
Backport Required
- [ ] not a bug fix
- [ ] papercut/not impactful enough to backport
- [] v22.2.x
- [] v22.1.x
- [] v21.11.x
UX changes
- None
Release notes
- none
Features
- None.
Improvements
- Recovers from failures quickly by cleaning up resources.
I'm testing it locally with induced sleeps in the flush code path + a load_generator in the background. Then I'm SIGINT-ing brokers in a rolling fashion and so far I can see brokers shutdown ok.. I'll keep running this test for a few hours, analyse the logs and report here. Only issue with these timeouts is that they are creating a lot of noise in other RPCs causing unnecessary leadership transfers and voting rounds (and a ton of uncaught exceptions traces in the logs).
/ci-repeat 5
/ci-repeat 5
The issue was with not cleaning up failure injector (ISOLATE) properly on teardown. I updated the patch and ci-repeat-5 looks ok, only failure in the run was VerifiableProducer not stopping in time but that is not related to the core fix here.. I'll look into it but the patch is ready for review.
@dotnwat can you please take a look at this, it looks good to me
/ci-repeat 5
/ci-repeat 10
I'm still running some experiments on the crc mismatch test failure.. I made some progress but I need to run a few more, please ignore noise from this PR, I'll re-request reviews when ready.
looks good, but i think we should consider preserving the original concurrency semantics in the changes to transport::do_send(sequence_t seq, netbuf b, rpc::client_opts opts) {, not because what you have doesn't look correct, but because the rpc bits are so fragile and subtle it's hard to analyze correctness w.r.t. some implicit ordering requirements. wdyt?
@dotnwat Took a deeper look at the CRC mismatch issue, so here is the summary
- It is not related to coroutinization of do_send, reproduces without that commit.
- It is not related to the failure injection in the test, reproduces without it.
- It is not related to the current patch at all, reproduces on the tip of
dev - It is related to the write throughput in the producer, I reduced it and the errors are gone. I've dug into the
VerifiableProducerthat wraps a regular kafka client, nothing jumps out. Does not repro for me locally at all (even in 1000s of runs). Seems like a client side issue (I hope) but I don't have any theory.
That said, I removed the coroutinization commit just to be extra safe and fixed the test by reducing the throughput, mind taking another look?
/ci-repeat 10
It is not related to the current patch at all, reproduces on the tip of dev It is related to the write throughput in the producer, I reduced it and the errors are gone. I've dug into the VerifiableProducer that wraps a regular kafka client, nothing jumps out. Does not repro for me locally at all (even in 1000s of runs). Seems like a client side issue (I hope) but I don't have any theory.
@bharathv is there a ticket tracking the failure and how to reproduce it?
@bharathv is there a ticket tracking the failure and how to reproduce it?
Tracking in https://github.com/redpanda-data/redpanda/issues/6631.
I updated the cover letter with backport info, since I think I've seen it here: https://buildkite.com/redpanda/redpanda/builds/16044#01839d62-8fd9-420d-ad3c-2e3f7edb4af9 - please adjust if that's not the case.
in some later PR it might be useful to sneak in a comment on batch output stream that the rpc layer depends on ordering imposed by the write interface. for example, it should never wake-up multiple waiters on the semaphore because those may lose their original FIFO ordering.
That makes sense.
/backport v22.2.x
/backport 22.2.x
@bharathv should this not be backported? IIUC it isn't fixing a specific bug
@bharathv should this not be backported? IIUC it isn't fixing a specific bug
Original theory was that this fixes the shutdown hang reported in #5324. And Ben recently reported an occurrence in 22.2.x branch, hence the backport. You think it is risky for a patch release?