Zhong Chaoqiang comments

Results 10 comments of


                                            Zhong Chaoqiang

potential replica may not be closed when dropped app

初步分析： potential状态的replica在learning结束,但状态未切换到secondary前,如果drop该表,会触发该问题： `D2021-09-16 20:21:42.890 (1631794902890873823 3fba) replica.replica10.0404000a00000ca5: replica_learn.cpp:1430:on_learn_completion_notification_reply(): 3.13@xxxxxxxxx:54801: on_learn_completion_notification_reply[0000000c00000002]: learnee = xxxxxxxxx:54801, learn_duration = 2358 ms, response_err = ERR_OK` 删除app后，replicaserver在同步meta的信息的时候，由于on_node_query_reply_scatter2并不会删除potiential状态的replica，所以造成了这些replica一直存在。这有可能会导致slog一直不能执行gc。

data loss after restarting

下面是分析过程： 1. last_durable_decree为上一次持久化的点，打开replica时从rocksdb的数据目录中读取出来，而init_durable_decree是从.init-info文件中读取出来。.init-info文件一般是在learn的时候生成，我们继续分析日志，发现replica在发生问题前是有发生learn的操作的，并且有如下日志： `D2020-12-03 21:27:09.499 (1607002029499473085 7754) replica.rep_long3.0404000d000021bd: replica_learn.cpp:1075:on_copy_remote_state_completed(): [email protected]:34801: on_copy_remote_state_completed[000001a600000002]: learnee = 10.32.82.16:34801, learn_duration = 5 ms, apply checkpoint/log done, err = ERR_OK, last_prepared_decree = (16369 => 16371), last_committed_decree...

data loss after restarting

@zhangyifan27 谢谢你的恢复！不好意思啊，没有及时看到你的信息！这个问题比较久远了，我们是在现网碰到的这个问题，具体怎么操作导致的这个问题当时也没有具体说明，但是应该是有多次的重启操作的。由于我们把err的清理时间（gc_disk_error_replica_interval_seconds）设置过短了，当时的确是丢数据了。（primary的机器下线了，secondary的节点open的时候把数据移到了err目录），所以用户读不到数据了。还有一个关键日志：在open replica的时候，打印了如下日志。 ``` E2020-12-03 21:27:57.929 (1607002077929018092 3b14) replica.replica0.04010000000000d9: replication_app_base.cpp:347:open_internal(): [email protected]:34801: replica data is not complete coz last_durable_decree(16368) < init_durable_decree(16371) E2020-12-03 21:27:57.929 (1607002077929048291 3b14) replica.replica0.04010000000000d9: replication_app_base.cpp:353:open_internal(): [email protected]:34801:...

Zhong Chaoqiang

potential replica may not be closed when dropped app

data loss after restarting

data loss after restarting

improve performance of count_data

improve performance of count_data

improve performance of count_data

improve performance of count_data

Feature enhancement: throttling by the count of kvs

fix: fix data loss

fix: fix data loss