redpanda
redpanda copied to clipboard
Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` `failures=True.cpus=3`
- FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=True.cpus=3 (1/72 runs)
- failure at 2022-07-04T08:05:02.564Z: TimeoutError('Redpanda node docker-rp-2 failed to stop in 60 seconds')
- in job https://buildkite.com/redpanda/redpanda/builds/12033#0181c7df-a02c-4d55-a971-399a1a1e571b
There was a previous issue https://github.com/redpanda-data/redpanda/issues/4670 about a "failed to stop" in this test that was related to the failure injection, but the logs in this failure look like an authentic shutdown hang in redpanda -- we can see it getting the signal to stop, but remaining alive for some time. There is compaction going on at the time it is signaled to stop.
I did some initial log analysis, seems like an issue in clean up of consensus
state..
Sequence of events..
- Shutdown request came in at
2022-07-04 07:02:24
but a single log was not getting shutdown, we can see that the log is active with the housekeeping activities beyond the test timeout of 60s, finally stops at2022-07-04 07:04:41
41076:INFO 2022-07-04 07:02:24,633 [shard 0] redpanda::main - application.cc:247 - Stopping...
.....
41160:TRACE 2022-07-04 07:02:25,256 [shard 0] storage - disk_log_impl.cc:122 - waiting for {kafka_internal/group/6} compaction to finish
.....
41245:TRACE 2022-07-04 07:02:29,840 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313349840}, max_bytes:18446744073709551615, should_sanitize:false}
41247:DEBUG 2022-07-04 07:02:29,841 [shard 1] storage-gc - disk_log_impl.cc:290 - [{kafka/topic-khoyjdamnd/0}] time retention timestamp: {timestamp: 1656313349840}, first segment max timestamp: {timestamp: 1656918141945}
41248:DEBUG 2022-07-04 07:02:29,841 [shard 1] storage-gc - disk_log_impl.cc:244 - [{kafka/topic-khoyjdamnd/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
41251:TRACE 2022-07-04 07:02:40,168 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313360168}, max_bytes:18446744073709551615, should_sanitize:false}
....
41258:TRACE 2022-07-04 07:02:51,998 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313371998}, max_bytes:18446744073709551615, should_sanitize:false}
41259:TRACE 2022-07-04 07:02:51,998 [shard 1] storage-gc - disk_log_impl.cc:571 - [{kafka/topic-khoyjdamnd/0}] applying 'deletion' log cleanup policy with config: {evicition_time:{timestamp: 1656313371998}, max_bytes:18446744073709551615, should_sanitize:false}
.....
41318:DEBUG 2022-07-04 07:04:34,821 [shard 1] storage-gc - disk_log_impl.cc:244 - [{kafka/topic-khoyjdamnd/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
41336:DEBUG 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:114 - closing log {offsets:{start_offset:0, committed_offset:11949, committed_offset_term:1, dirty_offset:11949, dirty_offset_term:1, last_term_start_offset:0}, max_collectible_offset: {-9223372036854775808}, closed:0, config:{ntp: {kafka/topic-khoyjdamnd/0}, base_dir: /var/lib/redpanda/data, overrides: {compaction_strategy: {nullopt}, cleanup_policy_bitflags: {delete}, segment_size: {nullopt}, retention_bytes: {}, retention_time_ms: {}, recovery_enabled: false}, revision: 32, initial_revision: 32}, logs:{size: 1, [{offset_tracker:{term:1, base_offset:0, committed_offset:11949, dirty_offset:11949}, compacted_segment=0, finished_self_compaction=0, reader={/var/lib/redpanda/data/kafka/topic-khoyjdamnd/0_32/0-1-v1.log, (199838 bytes)}, writer={no_of_chunks:64, closed:0, fallocation_offset:33554432, committed_offset:199838, bytes_flush_pending:0}, cache={cache_size=400}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/topic-khoyjdamnd/0_32/0-1-v1.base_index, offsets:{0}, index:{header_bitflags:0, base_offset:{0}, max_offset:{11949}, base_timestamp:{timestamp: 1656918114419}, max_timestamp:{timestamp: 1656918141945}, index(6,6,6)}, step:32768, needs_persistence:1}}]}}
41337:TRACE 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:122 - waiting for {kafka/topic-khoyjdamnd/0} compaction to finish
41338:TRACE 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:124 - stopping {kafka/topic-khoyjdamnd/0} readers cache
Its just one ntp
grep "closing log" redpanda.log | cut -d '{' -f1
DEBUG 2022-07-04 07:02:25,256 [shard 0] storage - disk_log_impl.cc:114 - closing log
DEBUG 2022-07-04 07:02:25,257 [shard 2] storage - disk_log_impl.cc:114 - closing log
DEBUG 2022-07-04 07:02:25,258 [shard 0] storage - disk_log_impl.cc:114 - closing log
...........
DEBUG 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:114 - closing log ntp ==> kafka/topic-khoyjdamnd/0
It took a lot longer for consensus to shutdown
(by then the leader already moved on to a different node)..
INFO 2022-07-04 07:02:25,255 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:226 - Stopping
DEBUG 2022-07-04 07:02:25,256 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_entries_stm.cc:357 - Replication of entries with last offset: 11949 aborted - shutting down
INFO 2022-07-04 07:02:26,811 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:526 - Node {id: {1}, revision: {32}} recovery cancelled (rpc::errc::client_request_timeout)
TRACE 2022-07-04 07:04:41,217 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2432 - Leader commit index updated 11949
WARN 2022-07-04 07:04:41,219 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_batcher.cc:74 - replicate flush failed - seastar::gate_closed_exception (gate closed)
WARN 2022-07-04 07:04:41,219 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_batcher.cc:74 - replicate flush failed - seastar::gate_closed_exception (gate closed)
INFO 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:174 - [heartbeats_majority] Stepping down as leader in term 1, dirty offset 11949
DEBUG 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2480 - triggering leadership notification with term: 1, new leader: {nullopt}
Think the timestamps suggest that the issue with delayed shutdown could possibly be in one of the following two (event manager shutdown seems trivial to me)...
ss::future<> consensus::stop() {
vlog(_ctxlog.info, "Stopping");
shutdown_input();
for (auto& idx : _fstats) {
idx.second.follower_state_change.broken();
}
return _event_manager.stop()
.then([this] { return _append_requests_buffer.stop(); }) <===
.then([this] { return _batcher.stop(); }) <===
.then([this] { return _bg.close(); })
.then([this] {
The following log and the timestamp suggest that, things were probably stuck in flush, some sort of interplay between append_entries_buffer and consensus shutdown, not obvious to me though.
INFO 2022-07-04 07:04:41,218 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] gate-closed, waiting to finish background requests
Finally it realizes it is not the leader...and steps down.. something seems to be messed up, need to take a closer look.
TRACE 2022-07-04 07:04:41,217 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2432 - Leader commit index updated 11949
INFO 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:174 - [heartbeats_majority] Stepping down as leader in term 1, dirty offset 11949
DEBUG 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2480 - triggering leadership notification with term: 1, new leader: {nullopt}
Seen again here:
FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1 (1/24 runs) failure at 2022-08-17T10:53:32.899Z: TimeoutError('Redpanda node docker-rp-5 failed to stop in 60 seconds') in job https://buildkite.com/redpanda/redpanda/builds/14253#0182aaf0-b6fb-4499-a1df-8288ad33b894
Stack trace
====================================================================================================
test_id: rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1
status: FAIL
run time: 3 minutes 40.696 seconds
TimeoutError('Redpanda node docker-rp-5 failed to stop in 60 seconds')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/consumer_offsets_migration_test.py", line 164, in test_migrating_consume_offsets
self.redpanda.restart_nodes(n, stop_timeout=60)
File "/root/tests/rptest/services/redpanda.py", line 1544, in restart_nodes
self.stop_node(node, timeout=stop_timeout)
File "/root/tests/rptest/services/redpanda.py", line 1347, in stop_node
wait_until(
File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda node docker-rp-5 failed to stop in 60 seconds
Another instance of this .. https://ci-artifacts.dev.vectorized.cloud/redpanda/0182cbda-ffdd-4339-90a7-ef336ed22f9c/vbuild/ducktape/results/2022-08-23--001/ConsumerOffsetsMigrationTest/test_migrating_consume_offsets/failures=False.cpus=1/47/
A potential fix for this is ready.. can someone review and merge please? Thanks.
Looks like on v22.2.x as well: https://buildkite.com/redpanda/redpanda/builds/16044#01839d62-8fd9-420d-ad3c-2e3f7edb4af9