redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

CI Failure (BadLogLines) in `RandomNodeOperationsTest.test_node_operations`

Open andijcr opened this issue 1 year ago • 53 comments

https://buildkite.com/redpanda/vtools/builds/10805

This issue is an umbrella for 3 instances of bad log lines:

  1. Can't apply override to kafka start offset -> Fixed in #15192
  2. Failed to make reupload candidate to match the run, candidate: {source segment offsets: {term:3, base_offset:15, committed_offset:483, dirty_offset:483}, exposed_name: {15-3-v1.log}, starting_offset: 15, file_offset: 0, content_length: 5108, final_offset: 483, final_file_offset: 5108, term: 3, source names: {/var/lib/redpanda/data/kafka/fuzzy-operator-2047-ygyvas/0_6696/15-3-v1.log}}, run: {is_compacted: true, size_bytes: 5224, base_offset: 15, committed_offset: 483, base_timestamp: {timestamp: 1703239361564}, max_timestamp: {timestamp: 1703239430338}, delta_offset: 15, ntp_revision: 64, archiver_term: 4, segment_term: 3, delta_offset_end: 25, sname_format: {v3}, metadata_size_hint: 0} Discussed in https://redpandadata.slack.com/archives/C05S5SBT4CU/p1704303384062129 Builds:
    • https://buildkite.com/redpanda/vtools/builds/11262
  3. ERROR 2023-12-14 11:38:06,999 [shard 0:main] cluster - topic_table.cc:184 - Unexpected record at offset 5399 to drop non-existent lifecycle marker {kafka/fuzzy-operator-5709-agcwue} 3685 Builds:
    • https://buildkite.com/redpanda/vtools/builds/11151#018c6758-5e5b-4eb6-bb5b-4009aa784930
  4. ERROR 2023-12-18 11:13:10,469 [shard 0:au ] archival - [fiber1~3|0|19974ms] - purger.cc:168 - Permanent failures encountered while purging partition {kafka/fuzzy-operator-532-qebtjv/2}. Giving up ... Builds:
    • https://ci-artifacts.dev.vectorized.cloud/vtools/11202/018c7bf3-cedc-4f4b-976c-ab7fb496f37b/vbuild/ducktape/results/2023-12-18--001/report.html
Module: rptest.tests.random_node_operations_test
Class: RandomNodeOperationsTest
Method: test_node_operations
Arguments: {
    "num_to_upgrade": 3,
    "enable_failures": false,
    "with_tiered_storage": true
}
test_id:    RandomNodeOperationsTest.test_node_operations
status:     FAIL
run time:   2061.472 seconds

<BadLogLines nodes=ip-172-31-33-161(2) example="ERROR 2023-11-25 23:15:40,652 [shard 0] cluster - ntp: {kafka/fuzzy-operator-2796-ncuizc/5} - archival_metadata_stm.cc:1229 - Can't apply override to kafka start offset 29, currently 29">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 142, in wrapped
    redpanda.raise_on_bad_logs(
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1321, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-33-161(2) example="ERROR 2023-11-25 23:15:40,652 [shard 0] cluster - ntp: {kafka/fuzzy-operator-2796-ncuizc/5} - archival_metadata_stm.cc:1229 - Can't apply override to kafka start offset 29, currently 29">

JIRA Link: CORE-1601

andijcr avatar Nov 27 '23 17:11 andijcr

not sure if this should be reported as error

andijcr avatar Nov 27 '23 17:11 andijcr

This was already fixed with: https://github.com/redpanda-data/redpanda/pull/15192

mmaslankaprv avatar Dec 04 '23 13:12 mmaslankaprv

*https://buildkite.com/redpanda/vtools/builds/11151 *https://buildkite.com/redpanda/vtools/builds/11151

vbotbuildovich avatar Dec 15 '23 01:12 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11202

vbotbuildovich avatar Dec 19 '23 00:12 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11214

vbotbuildovich avatar Dec 20 '23 00:12 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11243

vbotbuildovich avatar Dec 22 '23 00:12 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11262

vbotbuildovich avatar Dec 23 '23 00:12 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11262

at least the last failure mode is different:

<BadLogLines nodes=ip-172-31-13-176(2),ip-172-31-5-190(7),ip-172-31-2-19(29) example="ERROR 2023-12-22 10:09:03,417 [shard 1:au  ] archival - [fiber11 kafka/fuzzy-operator-2047-ygyvas/0] - ntp_archiver_service.cc:2832 - Failed to make reupload candidate to match the run, candidate: {source segment offsets: {term:3, base_offset:15, committed_offset:483, dirty_offset:483}, exposed_name: {15-3-v1.log}, starting_offset: 15, file_offset: 0, content_length: 5108, final_offset: 483, final_file_offset: 5108, term: 3, source names: {/var/lib/redpanda/data/kafka/fuzzy-operator-2047-ygyvas/0_6696/15-3-v1.log}}, run: {is_compacted: true, size_bytes: 5224, base_offset: 15, committed_offset: 483, base_timestamp: {timestamp: 1703239361564}, max_timestamp: {timestamp: 1703239430338}, delta_offset: 15, ntp_revision: 64, archiver_term: 4, segment_term: 3, delta_offset_end: 25, sname_format: {v3}, metadata_size_hint: 0}">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 142, in wrapped
    redpanda.raise_on_bad_logs(
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1325, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-13-176(2),ip-172-31-5-190(7),ip-172-31-2-19(29) example="ERROR 2023-12-22 10:09:03,417 [shard 1:au  ] archival - [fiber11 kafka/fuzzy-operator-2047-ygyvas/0] - ntp_archiver_service.cc:2832 - Failed to make reupload candidate to match the run, candidate: {source segment offsets: {term:3, base_offset:15, committed_offset:483, dirty_offset:483}, exposed_name: {15-3-v1.log}, starting_offset: 15, file_offset: 0, content_length: 5108, final_offset: 483, final_file_offset: 5108, term: 3, source names: {/var/lib/redpanda/data/kafka/fuzzy-operator-2047-ygyvas/0_6696/15-3-v1.log}}, run: {is_compacted: true, size_bytes: 5224, base_offset: 15, committed_offset: 483, base_timestamp: {timestamp: 1703239361564}, max_timestamp: {timestamp: 1703239430338}, delta_offset: 15, ntp_revision: 64, archiver_term: 4, segment_term: 3, delta_offset_end: 25, sname_format: {v3}, metadata_size_hint: 0}">

andijcr avatar Dec 28 '23 09:12 andijcr

*https://buildkite.com/redpanda/vtools/builds/11301 *https://buildkite.com/redpanda/vtools/builds/11329 *https://buildkite.com/redpanda/vtools/builds/11351

vbotbuildovich avatar Jan 09 '24 14:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11301 *https://buildkite.com/redpanda/vtools/builds/11367

vbotbuildovich avatar Jan 09 '24 14:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11490

vbotbuildovich avatar Jan 18 '24 00:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11502 *https://buildkite.com/redpanda/vtools/builds/11516

vbotbuildovich avatar Jan 20 '24 00:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11527

vbotbuildovich avatar Jan 21 '24 00:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11585

vbotbuildovich avatar Jan 27 '24 01:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11616

vbotbuildovich avatar Jan 27 '24 01:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11630

vbotbuildovich avatar Jan 29 '24 00:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11630

vbotbuildovich avatar Jan 29 '24 00:01 vbotbuildovich

~~https://buildkite.com/redpanda/vtools/builds/11651~~ <BadLogLines nodes=rp-node-9-needlessly-choice-egret(1) example="ERROR 2024-01-30 09:22:56,880 [shard 2:au ] cloud_storage - [fiber76~122|0|29632ms] - remote.cc:1243 - Failed to delete keys: Semaphore broken: self_config_barrier"> this specific instance of failure will be cleaned up here https://github.com/redpanda-data/redpanda/pull/16385

vbotbuildovich avatar Jan 31 '24 00:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11668

TimeoutError('Redpanda service rp-node-14-early-subtle-jennet failed to start within 360 sec')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 99, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/random_node_operations_test.py", line 428, in test_node_operations
    executor.execute_operation(op)
  File "/home/ubuntu/redpanda/tests/rptest/utils/node_operations.py", line 430, in execute_operation
    self.add(operation.node)
  File "/home/ubuntu/redpanda/tests/rptest/utils/node_operations.py", line 388, in add
    self.redpanda.start_node(node,
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2487, in start_node
    self.start_service(node, start_rp)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2563, in start_service
    start()
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 2478, in start_rp
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda service rp-node-14-early-subtle-jennet failed to start within 360 sec

vbotbuildovich avatar Feb 02 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11704

vbotbuildovich avatar Feb 04 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11704

<BadLogLines nodes=rp-node-11-perfectly-handy-oryx(1) example="ERROR 2024-02-03 09:04:19,088 [shard 1:au  ] archival - [fiber1~8|0|19915ms] - purger.cc:168 - Permanent failures encountered while purging partition {kafka/fuzzy-operator-6695-lqmlrr/7}. Giving up ...">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 168, in wrapped
    redpanda.raise_on_bad_logs(
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1413, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=rp-node-11-perfectly-handy-oryx(1) example="ERROR 2024-02-03 09:04:19,088 [shard 1:au  ] archival - [fiber1~8|0|19915ms] - purger.cc:168 - Permanent failures encountered while purging partition {kafka/fuzzy-operator-6695-lqmlrr/7}. Giving up ...">

a look at the log points out to the manifest for the partition not being in cloud storage, so maybe an edge case of a manifest not yet uploaded or a race in deleting it https://github.com/redpanda-data/redpanda/blob/72b7011faf7a1cb29765db2a2b4343667a03cc0a/src/v/archival/purger.cc#L239-L264

andijcr avatar Feb 05 '24 13:02 andijcr

*https://buildkite.com/redpanda/vtools/builds/11756

vbotbuildovich avatar Feb 08 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11766

vbotbuildovich avatar Feb 13 '24 20:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11824

vbotbuildovich avatar Feb 13 '24 20:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11845 *https://buildkite.com/redpanda/vtools/builds/11858

vbotbuildovich avatar Feb 15 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11870

vbotbuildovich avatar Feb 16 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11895 *https://buildkite.com/redpanda/vtools/builds/11902

vbotbuildovich avatar Feb 19 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11929

vbotbuildovich avatar Feb 21 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/12024

vbotbuildovich avatar Feb 29 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/12025

vbotbuildovich avatar Feb 29 '24 00:02 vbotbuildovich