redpanda rpc/transport: Release resource units right after dispatching send

Cover letter

Currently we have a timer that throws in case of timeouts but does not actually cleanup the underlying resource units. So if the dispatch of send blocks due to a remote issue (eg: in flush), the resources are held up until that returns or fails. In the linked issue, we believe the sequence of events is as follows.

raft _op_lock is grabbed by replicate_entries_stm and propagated to the transport#send (append_entries)
Test issues a shut down (blocks on consensus stop, request in progress)
send blocked for 2+ mins holding the op_lock resources (to another node in a questionable state)
Test timed out after 60 secs
send returned, request failed and the shutdown got unblocked.

Looking closely, we can actually release the units early as dispatch of send guarantees the ordering for us by using a semaphore.

Fixes #5324

Backport Required

[ ] not a bug fix
[ ] papercut/not impactful enough to backport
[] v22.2.x
[] v22.1.x
[] v21.11.x

UX changes

None

Release notes

none

Features

None.

Improvements

Recovers from failures quickly by cleaning up resources.

Jul 29 '22 15:07 bharathv

I'm testing it locally with induced sleeps in the flush code path + a load_generator in the background. Then I'm SIGINT-ing brokers in a rolling fashion and so far I can see brokers shutdown ok.. I'll keep running this test for a few hours, analyse the logs and report here. Only issue with these timeouts is that they are creating a lot of noise in other RPCs causing unnecessary leadership transfers and voting rounds (and a ton of uncaught exceptions traces in the logs).

Jul 29 '22 17:07 bharathv

/ci-repeat 5

Aug 04 '22 01:08 bharathv

/ci-repeat 5

Aug 04 '22 01:08 bharathv

The issue was with not cleaning up failure injector (ISOLATE) properly on teardown. I updated the patch and ci-repeat-5 looks ok, only failure in the run was VerifiableProducer not stopping in time but that is not related to the core fix here.. I'll look into it but the patch is ready for review.

Aug 04 '22 04:08 bharathv

@dotnwat can you please take a look at this, it looks good to me

Aug 22 '22 16:08 mmaslankaprv

/ci-repeat 5

Sep 28 '22 06:09 bharathv

/ci-repeat 10

Oct 01 '22 06:10 bharathv

I'm still running some experiments on the crc mismatch test failure.. I made some progress but I need to run a few more, please ignore noise from this PR, I'll re-request reviews when ready.

Oct 03 '22 17:10 bharathv

looks good, but i think we should consider preserving the original concurrency semantics in the changes to transport::do_send(sequence_t seq, netbuf b, rpc::client_opts opts) {, not because what you have doesn't look correct, but because the rpc bits are so fragile and subtle it's hard to analyze correctness w.r.t. some implicit ordering requirements. wdyt?

@dotnwat Took a deeper look at the CRC mismatch issue, so here is the summary

It is not related to coroutinization of do_send, reproduces without that commit.
It is not related to the failure injection in the test, reproduces without it.
It is not related to the current patch at all, reproduces on the tip of dev
It is related to the write throughput in the producer, I reduced it and the errors are gone. I've dug into the VerifiableProducer that wraps a regular kafka client, nothing jumps out. Does not repro for me locally at all (even in 1000s of runs). Seems like a client side issue (I hope) but I don't have any theory.

That said, I removed the coroutinization commit just to be extra safe and fixed the test by reducing the throughput, mind taking another look?

Oct 04 '22 03:10 bharathv

/ci-repeat 10

Oct 04 '22 03:10 bharathv

It is not related to the current patch at all, reproduces on the tip of dev It is related to the write throughput in the producer, I reduced it and the errors are gone. I've dug into the VerifiableProducer that wraps a regular kafka client, nothing jumps out. Does not repro for me locally at all (even in 1000s of runs). Seems like a client side issue (I hope) but I don't have any theory.

@bharathv is there a ticket tracking the failure and how to reproduce it?

Oct 05 '22 01:10 dotnwat

@bharathv is there a ticket tracking the failure and how to reproduce it?

Tracking in https://github.com/redpanda-data/redpanda/issues/6631.

Oct 05 '22 06:10 bharathv

I updated the cover letter with backport info, since I think I've seen it here: https://buildkite.com/redpanda/redpanda/builds/16044#01839d62-8fd9-420d-ad3c-2e3f7edb4af9 - please adjust if that's not the case.

Oct 05 '22 11:10 BenPope

in some later PR it might be useful to sneak in a comment on batch output stream that the rpc layer depends on ordering imposed by the write interface. for example, it should never wake-up multiple waiters on the semaphore because those may lose their original FIFO ordering.

That makes sense.

Oct 06 '22 05:10 bharathv

/backport v22.2.x

Oct 06 '22 05:10 bharathv

/backport 22.2.x

Oct 24 '22 18:10 bharathv

@bharathv should this not be backported? IIUC it isn't fixing a specific bug

Oct 26 '22 00:10 dotnwat

@bharathv should this not be backported? IIUC it isn't fixing a specific bug

Original theory was that this fixes the shutdown hang reported in #5324. And Ben recently reported an occurrence in 22.2.x branch, hence the backport. You think it is risky for a patch release?

Oct 26 '22 00:10 bharathv

redpanda redpanda copied to clipboard

rpc/transport: Release resource units right after dispatching send

Cover letter

Backport Required

UX changes

Release notes

Features

Improvements

redpanda
redpanda copied to clipboard