redpanda
redpanda copied to clipboard
cloud_storage: adjust condition to remove archiver
An archiver is removed in reconcile loop if it is considered to be stopped. The stop condition checks if the upload loop is stopped. with read replica there are two status variables which denote a stopped loop - one related to normal upload and other related to read replica.
This change adjusts the condition such that any of these two variables being true marks the archiver as stopped, as the two conditions start out being false and they turn to true in mutually exclusive code paths.
The archiver must be removed in order for a stopped upload loop to restart later. This can happen in a condition where a node lost leadership, archiver is stopped and it immediately regained leadership. It is then necessary for the archiver to have been removed correctly earlier for it to start again.
fixes #5928
Backport Required
- [ ] not a bug fix
- [ ] papercut/not impactful enough to backport
- [ ] v22.2.x
- [ ] v22.1.x
- [ ] v21.11.x
UX changes
None
Release notes
None
at least one test failure looks concerning:
test_id: rptest.tests.e2e_topic_recovery_test.EndToEndTopicRecovery.test_restore_with_aborted_tx.recovery_overrides=.retention.bytes.1024.redpanda.remote.write.True.redpanda.remote.read.True
--
| status: FAIL
| run time: 1 minute 41.920 seconds
|
|
| AssertionError("produced and consumed messages differ, produced length: 14139, consumed length: 14132, first mismatch: produced: b'14772', consumed: b'14863' (offset: 15243)")
| Traceback (most recent call last):
| File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
| data = self.run_test()
| File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
| return self.test_context.function(self.test)
| File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
| return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
| File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
| r = f(self, *args, **kwargs)
| File "/root/tests/rptest/tests/e2e_topic_recovery_test.py", line 292, in test_restore_with_aborted_tx
| assert (not first_mismatch), (
| AssertionError: produced and consumed messages differ, produced length: 14139, consumed length: 14132, first mismatch: produced: b'14772', consumed: b'14863' (offset: 15243)
will wait for debug build to finish and then trigger another run, to see if it may be related to this change.
EDIT: did not fail in debug, doing a repeatx5 to check what fails.
most test failures are variants of
failures=False failures=True
eg https://buildkite.com/redpanda/redpanda/builds/13929#0182885d-f1ef-4b0a-bb56-c7715aa0f404
couple of others which are mentioned on the PRs
test_id: rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1
/backport 22.2.x