incubator-pegasus icon indicating copy to clipboard operation
incubator-pegasus copied to clipboard

potential replica may not be closed when dropped app

Open ZhongChaoqiang opened this issue 3 years ago • 1 comments

Bug Report

删除表的时候 如果该表有部分分片处于PS_POTENTIAL_SECONDARY状态,有概率出现replica-server遗留了potential状态的replica无法关闭

版本为2.0.0

下面是删除app(appid=3)后,通过remote_command查询到的replica信息

D2021-09-17 06:40:44.606 (1631832044606185642 4017) replica.rep_long5.04010000000000b5: replica_stub.cpp:1707:on_gc(): start to garbage collection, replica_count = 4
D2021-09-17 06:40:44.606 (1631832044606199774 4017) replica.rep_long5.04010000000000b5: replica_stub.cpp:1746:on_gc(): gc_shared: gc condition for 3.97@xxxxxxxxx:54801, status = replication::partition_status::PS_POTENTIAL_SECONDARY, garbage_max_decree = 37803, last_durable_decree= 37804, plog_max_commit_on_disk = 37803
D2021-09-17 06:40:44.606 (1631832044606206186 4017) replica.rep_long5.04010000000000b5: replica_stub.cpp:1746:on_gc(): gc_shared: gc condition for 3.13@xxxxxxxxx:54801, status = replication::partition_status::PS_POTENTIAL_SECONDARY, garbage_max_decree = 37894, last_durable_decree= 37895, plog_max_commit_on_disk = 37894
D2021-09-17 06:40:44.606 (1631832044606229313 4017) replica.rep_long5.04010000000000b5: replica_stub.cpp:1746:on_gc(): gc_shared: gc condition for 3.49@xxxxxxxxx:54801, status = replication::partition_status::PS_POTENTIAL_SECONDARY, garbage_max_decree = 37838, last_durable_decree= 37839, plog_max_commit_on_disk = 37838
D2021-09-17 06:40:44.606 (1631832044606232808 4017) replica.rep_long5.04010000000000b5: replica_stub.cpp:1746:on_gc(): gc_shared: gc condition for 3.61@xxxxxxxxx:54801, status = replication::partition_status::PS_POTENTIAL_SECONDARY, garbage_max_decree = 37882, last_durable_decree= 37883, plog_max_commit_on_disk = 37882

ZhongChaoqiang avatar Sep 18 '21 01:09 ZhongChaoqiang

初步分析: potential状态的replica在learning结束,但状态未切换到secondary前,如果drop该表,会触发该问题: D2021-09-16 20:21:42.890 (1631794902890873823 3fba) replica.replica10.0404000a00000ca5: replica_learn.cpp:1430:on_learn_completion_notification_reply(): 3.13@xxxxxxxxx:54801: on_learn_completion_notification_reply[0000000c00000002]: learnee = xxxxxxxxx:54801, learn_duration = 2358 ms, response_err = ERR_OK

删除app后,replicaserver在同步meta的信息的时候,由于on_node_query_reply_scatter2并不会删除potiential状态的replica,所以造成了这些replica一直存在。这有可能会导致slog一直不能执行gc。

ZhongChaoqiang avatar Sep 18 '21 02:09 ZhongChaoqiang