incubator-uniffle
incubator-uniffle copied to clipboard
Introduce rejection mechanism when coordinator server is starting
Background
When changing some coordinator's conf and then restart, coordinator will accept client getAssignment request immediately, but it will serve for jobs request based on the partial registered shuffle-servers, which will make some jobs gotten not enough required shuffle-servers and then slow the running speed.
I think we should make coordinator wait for more than one shuffle-server heartbeat interval before serving for client. During out-of-service, requests from client will fallback to slave coordinator.
Besides, I think this rejection mechanism could be enabled by the coordinator conf.
PTAL @jerqi
How do we judge whether to get enough shuffle servers?
How do we judge whether to get enough shuffle servers?
Wait until reaching the shuffle server heartbeat interval, default 10s
How do we judge whether to get enough shuffle servers?
Wait until reaching the shuffle server heartbeat interval, default 10s
Maybe one heartbeat interval is not enough, we can't wait for any servers in special case. How do the yarn resourcemanager to process this problem? I suggest that we should pend the requests instead of rejection when we start the coordinator.
Got your thought.
How do the yarn resourcemanager to process this problem?
In HA resourcemanagers, there is no such problems due to the mechanism of failing back to standby active RM by zookeeper. Let's talk about it in single-one resourcemanager or hadoop namenode. As I know, the namenode will enter in the safe mode when starting it will exit until enough block reports from datanode have been accepted. Refer to : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
I suggest that we should pend the requests instead of rejection when we start the coordinator.
Pending will slow down the apps. I think we should make the request falling back to another coordinator. Maybe the heartbeat interval waiting when starting is a good tradeoff, this will be an indicator whether to exit the safe mode for coordinator.
Got your thought.
How do the yarn resourcemanager to process this problem?
In HA resourcemanagers, there is no such problems due to the mechanism of failing back to standby active RM by zookeeper. Let's talk about it in single-one resourcemanager or hadoop namenode. As I know, the namenode will enter in the safe mode when starting it will exit until enough block reports from datanode have been accepted. Refer to : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
I suggest that we should pend the requests instead of rejection when we start the coordinator.
Pending will slow down the apps. I think we should make the request falling back to another coordinator. Maybe the heartbeat interval waiting when starting is a good tradeoff, this will be an indicator whether to exit the safe mode for coordinator.
It means that we shouldn't restart the two coordinators during the short time. It's a little difficult for K8S controller to select a proper interval to restart them.
Now is the single coordinator process maintained in single POD or shared StatefulSet? If using single POD, it's not a problem. If using the shared statefulset, maybe we should kill single pod which managed by statefuset one by one by operator.
@wangao1236 Could u help give some background knowledge?
Now is the single coordinator process maintained in single POD or shared StatefulSet? If using single POD, it's not a problem. If using the shared statefulset, maybe we should kill single pod which managed by statefuset one by one by operator.
@wangao1236 Could u help give some background knowledge?
Coordinator use two deployments.
I think this design is not friendly for automation deployment. It's my doubt.
I think this design is not friendly for automation deployment. It's my doubt.
The coordinator deployment could be controlled to start by operator, this is not the problem. Right?
I think this design is not friendly for automation deployment. It's my doubt.
The coordinator deployment could be controlled to start by operator, this is not the problem. Right?
It's ok for our k8s operator. But if other users use another automation deployment mechanism, it may cause problems.
I think this design is not friendly for automation deployment. It's my doubt.
The coordinator deployment could be controlled to start by operator, this is not the problem. Right?
It's ok for our k8s operator. But if other users use another automation deployment mechanism, it may cause problems.
Got it. Maybe we could disable this mechanism default and add some docs to describe more
Any ideas on this? @jerqi
Any ideas on this? @jerqi
It's ok for me if we disable this mechanism by default. Is it a safe mode for coordinator?
Naming is difficult. Safe mode/ Recovery mode?
Solved in https://github.com/apache/incubator-uniffle/pull/247. Close this issue.