redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

rpc/transport: Release resource units right after dispatching send

Open bharathv opened this issue 3 years ago • 5 comments

Cover letter

Currently we have a timer that throws in case of timeouts but does not actually cleanup the underlying resource units. So if the dispatch of send blocks due to a remote issue (eg: in flush), the resources are held up until that returns or fails. In the linked issue, we believe the sequence of events is as follows.

  • raft _op_lock is grabbed by replicate_entries_stm and propagated to the transport#send (append_entries)
  • Test issues a shut down (blocks on consensus stop, request in progress)
  • send blocked for 2+ mins holding the op_lock resources (to another node in a questionable state)
  • Test timed out after 60 secs
  • send returned, request failed and the shutdown got unblocked.

Looking closely, we can actually release the units early as dispatch of send guarantees the ordering for us by using a semaphore.

Fixes #5324

Backport Required

  • [ ] not a bug fix
  • [ ] papercut/not impactful enough to backport
  • [] v22.2.x
  • [] v22.1.x
  • [] v21.11.x

UX changes

  • None

Release notes

  • none

Features

  • None.

Improvements

  • Recovers from failures quickly by cleaning up resources.

bharathv avatar Jul 29 '22 15:07 bharathv

I'm testing it locally with induced sleeps in the flush code path + a load_generator in the background. Then I'm SIGINT-ing brokers in a rolling fashion and so far I can see brokers shutdown ok.. I'll keep running this test for a few hours, analyse the logs and report here. Only issue with these timeouts is that they are creating a lot of noise in other RPCs causing unnecessary leadership transfers and voting rounds (and a ton of uncaught exceptions traces in the logs).

bharathv avatar Jul 29 '22 17:07 bharathv

/ci-repeat 5

bharathv avatar Aug 04 '22 01:08 bharathv

/ci-repeat 5

bharathv avatar Aug 04 '22 01:08 bharathv

The issue was with not cleaning up failure injector (ISOLATE) properly on teardown. I updated the patch and ci-repeat-5 looks ok, only failure in the run was VerifiableProducer not stopping in time but that is not related to the core fix here.. I'll look into it but the patch is ready for review.

bharathv avatar Aug 04 '22 04:08 bharathv

@dotnwat can you please take a look at this, it looks good to me

mmaslankaprv avatar Aug 22 '22 16:08 mmaslankaprv

/ci-repeat 5

bharathv avatar Sep 28 '22 06:09 bharathv

/ci-repeat 10

bharathv avatar Oct 01 '22 06:10 bharathv

I'm still running some experiments on the crc mismatch test failure.. I made some progress but I need to run a few more, please ignore noise from this PR, I'll re-request reviews when ready.

bharathv avatar Oct 03 '22 17:10 bharathv

looks good, but i think we should consider preserving the original concurrency semantics in the changes to transport::do_send(sequence_t seq, netbuf b, rpc::client_opts opts) {, not because what you have doesn't look correct, but because the rpc bits are so fragile and subtle it's hard to analyze correctness w.r.t. some implicit ordering requirements. wdyt?

@dotnwat Took a deeper look at the CRC mismatch issue, so here is the summary

  • It is not related to coroutinization of do_send, reproduces without that commit.
  • It is not related to the failure injection in the test, reproduces without it.
  • It is not related to the current patch at all, reproduces on the tip of dev
  • It is related to the write throughput in the producer, I reduced it and the errors are gone. I've dug into the VerifiableProducer that wraps a regular kafka client, nothing jumps out. Does not repro for me locally at all (even in 1000s of runs). Seems like a client side issue (I hope) but I don't have any theory.

That said, I removed the coroutinization commit just to be extra safe and fixed the test by reducing the throughput, mind taking another look?

bharathv avatar Oct 04 '22 03:10 bharathv

/ci-repeat 10

bharathv avatar Oct 04 '22 03:10 bharathv

It is not related to the current patch at all, reproduces on the tip of dev It is related to the write throughput in the producer, I reduced it and the errors are gone. I've dug into the VerifiableProducer that wraps a regular kafka client, nothing jumps out. Does not repro for me locally at all (even in 1000s of runs). Seems like a client side issue (I hope) but I don't have any theory.

@bharathv is there a ticket tracking the failure and how to reproduce it?

dotnwat avatar Oct 05 '22 01:10 dotnwat

@bharathv is there a ticket tracking the failure and how to reproduce it?

Tracking in https://github.com/redpanda-data/redpanda/issues/6631.

bharathv avatar Oct 05 '22 06:10 bharathv

I updated the cover letter with backport info, since I think I've seen it here: https://buildkite.com/redpanda/redpanda/builds/16044#01839d62-8fd9-420d-ad3c-2e3f7edb4af9 - please adjust if that's not the case.

BenPope avatar Oct 05 '22 11:10 BenPope

in some later PR it might be useful to sneak in a comment on batch output stream that the rpc layer depends on ordering imposed by the write interface. for example, it should never wake-up multiple waiters on the semaphore because those may lose their original FIFO ordering.

That makes sense.

bharathv avatar Oct 06 '22 05:10 bharathv

/backport v22.2.x

bharathv avatar Oct 06 '22 05:10 bharathv

/backport 22.2.x

bharathv avatar Oct 24 '22 18:10 bharathv

@bharathv should this not be backported? IIUC it isn't fixing a specific bug

dotnwat avatar Oct 26 '22 00:10 dotnwat

@bharathv should this not be backported? IIUC it isn't fixing a specific bug

Original theory was that this fixes the shutdown hang reported in #5324. And Ben recently reported an occurrence in 22.2.x branch, hence the backport. You think it is risky for a patch release?

bharathv avatar Oct 26 '22 00:10 bharathv