redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

cloud_storage: adjust condition to remove archiver

Open abhijat opened this issue 3 years ago • 0 comments
trafficstars

An archiver is removed in reconcile loop if it is considered to be stopped. The stop condition checks if the upload loop is stopped. with read replica there are two status variables which denote a stopped loop - one related to normal upload and other related to read replica.

This change adjusts the condition such that any of these two variables being true marks the archiver as stopped, as the two conditions start out being false and they turn to true in mutually exclusive code paths.

The archiver must be removed in order for a stopped upload loop to restart later. This can happen in a condition where a node lost leadership, archiver is stopped and it immediately regained leadership. It is then necessary for the archiver to have been removed correctly earlier for it to start again.

fixes #5928

Backport Required

  • [ ] not a bug fix
  • [ ] papercut/not impactful enough to backport
  • [ ] v22.2.x
  • [ ] v22.1.x
  • [ ] v21.11.x

UX changes

None

Release notes

None

abhijat avatar Aug 10 '22 09:08 abhijat

at least one test failure looks concerning:



test_id:    rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True
--
  | status:     FAIL
  | run time:   1 minute 41.920 seconds
  |  
  |  
  | AssertionError("produced and consumed messages differ, produced length: 14139, consumed length: 14132, first mismatch: produced: b'14772', consumed: b'14863' (offset: 15243)")
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 292, in test_restore_with_aborted_tx
  | assert (not first_mismatch), (
  | AssertionError: produced and consumed messages differ, produced length: 14139, consumed length: 14132, first mismatch: produced: b'14772', consumed: b'14863' (offset: 15243)

will wait for debug build to finish and then trigger another run, to see if it may be related to this change.

EDIT: did not fail in debug, doing a repeatx5 to check what fails.

abhijat avatar Aug 10 '22 15:08 abhijat

most test failures are variants of

failures=False failures=True

eg https://buildkite.com/redpanda/redpanda/builds/13929#0182885d-f1ef-4b0a-bb56-c7715aa0f404

couple of others which are mentioned on the PRs

test_id: rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1  

abhijat avatar Aug 10 '22 17:08 abhijat

/backport 22.2.x

piyushredpanda avatar Aug 10 '22 22:08 piyushredpanda