nebula
nebula copied to clipboard
Balance data problem, it will be stuck in wait for catch up data stage.
Introduction
When use balance data
in huge dataset (LDBC SF100) almost 300G data. I meet the storage in order to let follower catch up the leader's data, it will be call waitingForCatchUpData to storaged.
Because the storaged's leader need to copy snapshot to the follower. so it will be need more time.
So the problem here, the waitingForCatchUpData retry time or wait time not enough. the balance's task will be failed always in huge dataset.(I think if we do set larger partition in create space, but this not make sense. )
Roughly Solution: May can change storage rpc timeout. but same time, I found the sotrageclient about ThriftClientManager one problem.
(Because the ThriftClientManager's client has cache mechanism, but if we need hot config FLAGS_storage_client_timeout_ms, the ThriftClientManager's cache client can not take effect. because the RocketClientChannel 's timeout already be set in the before.
Two Problem:
- the balance data catch up data problem when the space data is huge.
- the storage client timeout can not set in host config.(cache mechanism)
cc @Sophie-Xie