incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

Introduce rejection mechanism when coordinator server is starting

Open zuston opened this issue 3 years ago • 15 comments

Background

When changing some coordinator's conf and then restart, coordinator will accept client getAssignment request immediately, but it will serve for jobs request based on the partial registered shuffle-servers, which will make some jobs gotten not enough required shuffle-servers and then slow the running speed.

I think we should make coordinator wait for more than one shuffle-server heartbeat interval before serving for client. During out-of-service, requests from client will fallback to slave coordinator.

Besides, I think this rejection mechanism could be enabled by the coordinator conf.

zuston avatar Sep 21 '22 10:09 zuston

PTAL @jerqi

zuston avatar Sep 21 '22 10:09 zuston

How do we judge whether to get enough shuffle servers?

jerqi avatar Sep 21 '22 11:09 jerqi

How do we judge whether to get enough shuffle servers?

Wait until reaching the shuffle server heartbeat interval, default 10s

zuston avatar Sep 21 '22 12:09 zuston

How do we judge whether to get enough shuffle servers?

Wait until reaching the shuffle server heartbeat interval, default 10s

Maybe one heartbeat interval is not enough, we can't wait for any servers in special case. How do the yarn resourcemanager to process this problem? I suggest that we should pend the requests instead of rejection when we start the coordinator.

jerqi avatar Sep 22 '22 02:09 jerqi

Got your thought.

How do the yarn resourcemanager to process this problem?

In HA resourcemanagers, there is no such problems due to the mechanism of failing back to standby active RM by zookeeper. Let's talk about it in single-one resourcemanager or hadoop namenode. As I know, the namenode will enter in the safe mode when starting it will exit until enough block reports from datanode have been accepted. Refer to : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

I suggest that we should pend the requests instead of rejection when we start the coordinator.

Pending will slow down the apps. I think we should make the request falling back to another coordinator. Maybe the heartbeat interval waiting when starting is a good tradeoff, this will be an indicator whether to exit the safe mode for coordinator.

zuston avatar Sep 22 '22 02:09 zuston

Got your thought.

How do the yarn resourcemanager to process this problem?

In HA resourcemanagers, there is no such problems due to the mechanism of failing back to standby active RM by zookeeper. Let's talk about it in single-one resourcemanager or hadoop namenode. As I know, the namenode will enter in the safe mode when starting it will exit until enough block reports from datanode have been accepted. Refer to : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

I suggest that we should pend the requests instead of rejection when we start the coordinator.

Pending will slow down the apps. I think we should make the request falling back to another coordinator. Maybe the heartbeat interval waiting when starting is a good tradeoff, this will be an indicator whether to exit the safe mode for coordinator.

It means that we shouldn't restart the two coordinators during the short time. It's a little difficult for K8S controller to select a proper interval to restart them.

jerqi avatar Sep 22 '22 03:09 jerqi

Now is the single coordinator process maintained in single POD or shared StatefulSet? If using single POD, it's not a problem. If using the shared statefulset, maybe we should kill single pod which managed by statefuset one by one by operator.

@wangao1236 Could u help give some background knowledge?

zuston avatar Sep 22 '22 05:09 zuston

Now is the single coordinator process maintained in single POD or shared StatefulSet? If using single POD, it's not a problem. If using the shared statefulset, maybe we should kill single pod which managed by statefuset one by one by operator.

@wangao1236 Could u help give some background knowledge?

Coordinator use two deployments.

jerqi avatar Sep 22 '22 08:09 jerqi

I think this design is not friendly for automation deployment. It's my doubt.

jerqi avatar Sep 22 '22 08:09 jerqi

I think this design is not friendly for automation deployment. It's my doubt.

The coordinator deployment could be controlled to start by operator, this is not the problem. Right?

zuston avatar Sep 22 '22 08:09 zuston

I think this design is not friendly for automation deployment. It's my doubt.

The coordinator deployment could be controlled to start by operator, this is not the problem. Right?

It's ok for our k8s operator. But if other users use another automation deployment mechanism, it may cause problems.

jerqi avatar Sep 22 '22 08:09 jerqi

I think this design is not friendly for automation deployment. It's my doubt.

The coordinator deployment could be controlled to start by operator, this is not the problem. Right?

It's ok for our k8s operator. But if other users use another automation deployment mechanism, it may cause problems.

Got it. Maybe we could disable this mechanism default and add some docs to describe more

zuston avatar Sep 22 '22 08:09 zuston

Any ideas on this? @jerqi

zuston avatar Sep 27 '22 00:09 zuston

Any ideas on this? @jerqi

It's ok for me if we disable this mechanism by default. Is it a safe mode for coordinator?

jerqi avatar Sep 27 '22 02:09 jerqi

Naming is difficult. Safe mode/ Recovery mode?

zuston avatar Sep 27 '22 02:09 zuston

Solved in https://github.com/apache/incubator-uniffle/pull/247. Close this issue.

zuston avatar Oct 11 '22 09:10 zuston