redpanda Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume

FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=True.cpus=3 (1/72 runs)
- failure at 2022-07-04T08:05:02.564Z: TimeoutError('Redpanda node docker-rp-2 failed to stop in 60 seconds')
- in job https://buildkite.com/redpanda/redpanda/builds/12033#0181c7df-a02c-4d55-a971-399a1a1e571b

There was a previous issue https://github.com/redpanda-data/redpanda/issues/4670 about a "failed to stop" in this test that was related to the failure injection, but the logs in this failure look like an authentic shutdown hang in redpanda -- we can see it getting the signal to stop, but remaining alive for some time. There is compaction going on at the time it is signaled to stop.

Jul 04 '22 09:07 jcsp

I did some initial log analysis, seems like an issue in clean up of consensus state..

Sequence of events..

Shutdown request came in at 2022-07-04 07:02:24 but a single log was not getting shutdown, we can see that the log is active with the housekeeping activities beyond the test timeout of 60s, finally stops at 2022-07-04 07:04:41

41076:INFO  2022-07-04 07:02:24,633 [shard 0] redpanda::main - application.cc:247 - Stopping...
.....

41160:TRACE 2022-07-04 07:02:25,256 [shard 0] storage - disk_log_impl.cc:122 - waiting for {kafka_internal/group/6} compaction to finish
.....

41245:TRACE 2022-07-04 07:02:29,840 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313349840}, max_bytes:18446744073709551615, should_sanitize:false}
41247:DEBUG 2022-07-04 07:02:29,841 [shard 1] storage-gc - disk_log_impl.cc:290 - [{kafka/topic-khoyjdamnd/0}] time retention timestamp: {timestamp: 1656313349840}, first segment max timestamp: {timestamp: 1656918141945}
41248:DEBUG 2022-07-04 07:02:29,841 [shard 1] storage-gc - disk_log_impl.cc:244 - [{kafka/topic-khoyjdamnd/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
41251:TRACE 2022-07-04 07:02:40,168 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313360168}, max_bytes:18446744073709551615, should_sanitize:false}
....

41258:TRACE 2022-07-04 07:02:51,998 [shard 1] storage-gc - disk_log_impl.cc:542 - [{kafka/topic-khoyjdamnd/0}] houskeeping with configuration from manager: {evicition_time:{timestamp: 1656313371998}, max_bytes:18446744073709551615, should_sanitize:false}
41259:TRACE 2022-07-04 07:02:51,998 [shard 1] storage-gc - disk_log_impl.cc:571 - [{kafka/topic-khoyjdamnd/0}] applying 'deletion' log cleanup policy with config: {evicition_time:{timestamp: 1656313371998}, max_bytes:18446744073709551615, should_sanitize:false}

.....

41318:DEBUG 2022-07-04 07:04:34,821 [shard 1] storage-gc - disk_log_impl.cc:244 - [{kafka/topic-khoyjdamnd/0}] gc[time_based_retention] requested to remove segments up to -9223372036854775808 offset
41336:DEBUG 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:114 - closing log {offsets:{start_offset:0, committed_offset:11949, committed_offset_term:1, dirty_offset:11949, dirty_offset_term:1, last_term_start_offset:0}, max_collectible_offset: {-9223372036854775808}, closed:0, config:{ntp: {kafka/topic-khoyjdamnd/0}, base_dir: /var/lib/redpanda/data, overrides: {compaction_strategy: {nullopt}, cleanup_policy_bitflags: {delete}, segment_size: {nullopt}, retention_bytes: {}, retention_time_ms: {}, recovery_enabled: false}, revision: 32, initial_revision: 32}, logs:{size: 1, [{offset_tracker:{term:1, base_offset:0, committed_offset:11949, dirty_offset:11949}, compacted_segment=0, finished_self_compaction=0, reader={/var/lib/redpanda/data/kafka/topic-khoyjdamnd/0_32/0-1-v1.log, (199838 bytes)}, writer={no_of_chunks:64, closed:0, fallocation_offset:33554432, committed_offset:199838, bytes_flush_pending:0}, cache={cache_size=400}, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/topic-khoyjdamnd/0_32/0-1-v1.base_index, offsets:{0}, index:{header_bitflags:0, base_offset:{0}, max_offset:{11949}, base_timestamp:{timestamp: 1656918114419}, max_timestamp:{timestamp: 1656918141945}, index(6,6,6)}, step:32768, needs_persistence:1}}]}}
41337:TRACE 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:122 - waiting for {kafka/topic-khoyjdamnd/0} compaction to finish
41338:TRACE 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:124 - stopping {kafka/topic-khoyjdamnd/0} readers cache

Its just one ntp

grep "closing log" redpanda.log  | cut -d '{' -f1
DEBUG 2022-07-04 07:02:25,256 [shard 0] storage - disk_log_impl.cc:114 - closing log 
DEBUG 2022-07-04 07:02:25,257 [shard 2] storage - disk_log_impl.cc:114 - closing log 
DEBUG 2022-07-04 07:02:25,258 [shard 0] storage - disk_log_impl.cc:114 - closing log 
........... 
DEBUG 2022-07-04 07:04:41,226 [shard 1] storage - disk_log_impl.cc:114 - closing log  ntp ==> kafka/topic-khoyjdamnd/0

It took a lot longer for consensus to shutdown (by then the leader already moved on to a different node)..

INFO  2022-07-04 07:02:25,255 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:226 - Stopping
DEBUG 2022-07-04 07:02:25,256 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_entries_stm.cc:357 - Replication of entries with last offset: 11949 aborted - shutting down
INFO  2022-07-04 07:02:26,811 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:526 - Node {id: {1}, revision: {32}} recovery cancelled (rpc::errc::client_request_timeout)
TRACE 2022-07-04 07:04:41,217 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2432 - Leader commit index updated 11949
WARN  2022-07-04 07:04:41,219 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_batcher.cc:74 - replicate flush failed - seastar::gate_closed_exception (gate closed)
WARN  2022-07-04 07:04:41,219 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] replicate_batcher.cc:74 - replicate flush failed - seastar::gate_closed_exception (gate closed)
INFO  2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:174 - [heartbeats_majority] Stepping down as leader in term 1, dirty offset 11949
DEBUG 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2480 - triggering leadership notification with term: 1, new leader: {nullopt}

Think the timestamps suggest that the issue with delayed shutdown could possibly be in one of the following two (event manager shutdown seems trivial to me)...

ss::future<> consensus::stop() {
    vlog(_ctxlog.info, "Stopping");
    shutdown_input();
    for (auto& idx : _fstats) {
        idx.second.follower_state_change.broken();
    }
    return _event_manager.stop()
      .then([this] { return _append_requests_buffer.stop(); }) <===
      .then([this] { return _batcher.stop(); }) <===
      .then([this] { return _bg.close(); })
      .then([this] {

The following log and the timestamp suggest that, things were probably stuck in flush, some sort of interplay between append_entries_buffer and consensus shutdown, not obvious to me though.

INFO  2022-07-04 07:04:41,218 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] gate-closed, waiting to finish background requests

Finally it realizes it is not the leader...and steps down.. something seems to be messed up, need to take a closer look.

TRACE 2022-07-04 07:04:41,217 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2432 - Leader commit index updated 11949
INFO  2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:174 - [heartbeats_majority] Stepping down as leader in term 1, dirty offset 11949
DEBUG 2022-07-04 07:04:41,222 [shard 1] raft - [group_id:1, {kafka/topic-khoyjdamnd/0}] consensus.cc:2480 - triggering leadership notification with term: 1, new leader: {nullopt}

Jul 20 '22 15:07 bharathv

Seen again here:

FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1 (1/24 runs) failure at 2022-08-17T10:53:32.899Z: TimeoutError('Redpanda node docker-rp-5 failed to stop in 60 seconds') in job https://buildkite.com/redpanda/redpanda/builds/14253#0182aaf0-b6fb-4499-a1df-8288ad33b894

Stack trace

====================================================================================================
test_id:    rptest.tests.consumer_offsets_migration_test.ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=False.cpus=1
status:     FAIL
run time:   3 minutes 40.696 seconds


    TimeoutError('Redpanda node docker-rp-5 failed to stop in 60 seconds')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/consumer_offsets_migration_test.py", line 164, in test_migrating_consume_offsets
    self.redpanda.restart_nodes(n, stop_timeout=60)
  File "/root/tests/rptest/services/redpanda.py", line 1544, in restart_nodes
    self.stop_node(node, timeout=stop_timeout)
  File "/root/tests/rptest/services/redpanda.py", line 1347, in stop_node
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Redpanda node docker-rp-5 failed to stop in 60 seconds

Aug 17 '22 18:08 andrwng

Another instance of this .. https://ci-artifacts.dev.vectorized.cloud/redpanda/0182cbda-ffdd-4339-90a7-ef336ed22f9c/vbuild/ducktape/results/2022-08-23--001/ConsumerOffsetsMigrationTest/test_migrating_consume_offsets/failures=False.cpus=1/47/

A potential fix for this is ready.. can someone review and merge please? Thanks.

Aug 24 '22 04:08 bharathv

Looks like on v22.2.x as well: https://buildkite.com/redpanda/redpanda/builds/16044#01839d62-8fd9-420d-ad3c-2e3f7edb4af9

Oct 05 '22 11:10 BenPope

redpanda
redpanda copied to clipboard

Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` `failures=True.cpus=3`

redpanda redpanda copied to clipboard

Shutdown hang in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` `failures=True.cpus=3`

redpanda
redpanda copied to clipboard