volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Managed and scheduled multiple single pods with podgroup

Open zbbkeepgoing opened this issue 1 year ago • 40 comments

For solve this situation

  • When we use spark client model to submit spark job in k8s with volcano scheduler. hope all executors can be managed and scheduled through podgroup.

  • hope some single pods can be managed and scheduled through podgroup with volcano.

zbbkeepgoing avatar Jul 18 '22 09:07 zbbkeepgoing

Link to the api pull request: https://github.com/volcano-sh/apis/pull/83. I will update go.mod after pr of api is merged

zbbkeepgoing avatar Jul 18 '22 09:07 zbbkeepgoing

Hey. Kindly to ask some questions about the background:

  • Why do you want to separate the whole spark job to different and independent pods instead of making use of spark job or volcano job?
  • I've not had a try about that. But as what I understand, Volcano has supported to generate PodGroup for separate pods. Is there something wrong with your test?

1、In Spark's Client submit mode. all the executor pod is independent, if we want to manage them by volcano, it need to group them in a ng. And a large part of the current spark users are also submitted using the client mode. 2、Yes, volcano support generate pg for separate pods, but not support pg's minresource not support, and this pr will support it. In addition, the generated podgroup only refer the first pod, if the first pod terminated, pg will be delete auto.

So the pr is support minresource for independent pods, and make pg refer to all the independent pods.

zbbkeepgoing avatar Aug 01 '22 02:08 zbbkeepgoing

/retest

zbbkeepgoing avatar Aug 19 '22 03:08 zbbkeepgoing

@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Aug 19 '22 03:08 volcano-sh-bot

please compress multiple commits into a single commit.

hwdef avatar Aug 19 '22 06:08 hwdef

please compress multiple commits into a single commit.

done

zbbkeepgoing avatar Aug 20 '22 12:08 zbbkeepgoing

/retest

zbbkeepgoing avatar Aug 20 '22 13:08 zbbkeepgoing

@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Aug 20 '22 13:08 volcano-sh-bot

Can this pull request be merged into master? @Thor-wl

zbbkeepgoing avatar Sep 21 '22 05:09 zbbkeepgoing

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign hwdef You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Nov 07 '22 05:11 volcano-sh-bot

/retest

zbbkeepgoing avatar Nov 10 '22 05:11 zbbkeepgoing

@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Nov 10 '22 05:11 volcano-sh-bot

@jiangkaihua Can this pull request be merged? We discussed this pr in the past week

zbbkeepgoing avatar Nov 10 '22 05:11 zbbkeepgoing

@jiangkaihua Can this pull request be merged? We discussed this pr in the past week

Could you please help fix the code check? It seems that some UT cases did not go through.

jiangkaihua avatar Nov 14 '22 06:11 jiangkaihua

@jiangkaihua Can this pull request be merged? We discussed this pr in the past week

Could you please help fix the code check? It seems that some UT cases did not go through.

it's looks no problem, it shound be a ramdon exception, which has nothing to do with this pr, encountered before

zbbkeepgoing avatar Nov 16 '22 07:11 zbbkeepgoing

/cc @Yikun Can you help evaluate the rationality of the demand and scenario? Thanks!

Thor-wl avatar Jan 10 '23 07:01 Thor-wl

/retest

zbbkeepgoing avatar Jan 30 '23 06:01 zbbkeepgoing

@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Jan 30 '23 06:01 volcano-sh-bot

@zbbkeepgoing Thanks for contribution!

IIUC, this PR adds ability to help podgroup auto-creation with minResource hint injection using pod annotation when first pod is created.

Currently, in Spark client, users need to create the podgroup outside (out of volcano), and specified annotation in executor pods.

By this PR:

  • Advantages: Help user to create the podgroup with annotation inside volcano (rather than outside volcano).
  • Disadvantages: the annotations style (especilly for complex json) might a lite way but not a good way in CRD world. If user want to use all volcano podgroup feature. We should also support like minMember, priorityClassName. After all features supporte, this might bring some comlex user interface to users.

And also after this PR the spark cluster mode can also use this ability to create driver podgroup, and assgin executors to executor podgroup.

So, if we can accept pod annotation as the API to create podgroup, then I +1 this PR. Otherwise, I am -0 (not a big disagree) on this.

(Think aloud: If it's accepted, does this mean that the PodGroup CRD is redundant in some level? Because annotation can help us do everything.)

also cc @william-wang

@william-wang Hi,leibo. Any questions or suggestions about this pr?

zbbkeepgoing avatar Feb 03 '23 07:02 zbbkeepgoing

@william-wang @Thor-wl If there is no doubt about this PR, can we merge it as soon as possible? This PR has been hanging for a long time. At present, I see other users in the Wechat group who also have this requirement.

zbbkeepgoing avatar Mar 28 '23 02:03 zbbkeepgoing

/assign @william-wang

zbbkeepgoing avatar Mar 28 '23 02:03 zbbkeepgoing

I think there might be some issues here that are not very clear:

  1. If the minresource information is injected into the podgroup through annotation, if the minresource information marked in each pod is different, how should it be handled? At the application level, should the minresource information of the first pod be directly selected, or should it be selected? What about the maximum minresource information of all pods under the podgroup?

  2. If the minresource information of the pod changes, does Volcano need to process it?

  3. The ownerreference information of the podgroup is automatically associated. For example, when an independent pod is created, the ownerreference information of the podgroup is the pod. If multiple pods belong to one replicas, the ownerreference information of the podgroup is the replicas. For the current scenario, multiple unreferenced The independent pods related to each other are placed in a podgroup. How to set the ownerreference information? If it is set to the first pod, then after the first pod is deleted directly, the current podgroup will be automatically deleted, and other pods may still be deleted. Is there a problem if the execution has not been completed or has not been scheduled?

Hope to get more discussion, thanks!

wangyang0616 avatar Mar 29 '23 03:03 wangyang0616

I think there might be some issues here that are not very clear:

  1. If the minresource information is injected into the podgroup through annotation, if the minresource information marked in each pod is different, how should it be handled? At the application level, should the minresource information of the first pod be directly selected, or should it be selected? What about the maximum minresource information of all pods under the podgroup?
  2. If the minresource information of the pod changes, does Volcano need to process it?
  3. The ownerreference information of the podgroup is automatically associated. For example, when an independent pod is created, the ownerreference information of the podgroup is the pod. If multiple pods belong to one replicas, the ownerreference information of the podgroup is the replicas. For the current scenario, multiple unreferenced The independent pods related to each other are placed in a podgroup. How to set the ownerreference information? If it is set to the first pod, then after the first pod is deleted directly, the current podgroup will be automatically deleted, and other pods may still be deleted. Is there a problem if the execution has not been completed or has not been scheduled?

Hope to get more discussion, thanks!

  1. First, the minresource in annotations is bound to the pod group name in annotations, if user not specify pod group name, minresource annotation will not used. In addition, what is provided here is an extension capability for special but common scenarios. The strange behavior of users should not be considered. If i am a user, i need to use this ability normally, I should follow this rule.
  2. Personally, I don’t think it needs to be considered for the time being, and the current Controller logic does not consider changing the podgroupname after the podgroup is automatically created by a single pod. In addition, I think that minresource‘s updated should be more important in Scheduler. At present, if the minresource of an inqueue podgroup is increased, and the resources of the queue are not enough at this time, should the podgroup be rolled back to pending? The above are some of my personal understanding. Same as the meaning of the first point, this is a solution in a special but common scenario. Of course, I do not reject implementing update minresource in the code if we believe that modifying minresource is a frequent and necessary requirement.
  3. The current code is the ownerreference information of the podgroup is all pods specifying the pg.

zbbkeepgoing avatar Mar 29 '23 08:03 zbbkeepgoing

  1. First, the minresource in annotations is bound to the pod group name in annotations, if user not specify pod group name, minresource annotation will not used. In addition, what is provided here is an extension capability for special but common scenarios. The strange behavior of users should not be considered. If i am a user, i need to use this ability normally, I should follow this rule.
  2. Personally, I don’t think it needs to be considered for the time being, and the current Controller logic does not consider changing the podgroupname after the podgroup is automatically created by a single pod. In addition, I think that minresource‘s updated should be more important in Scheduler. At present, if the minresource of an inqueue podgroup is increased, and the resources of the queue are not enough at this time, should the podgroup be rolled back to pending? The above are some of my personal understanding. Same as the meaning of the first point, this is a solution in a special but common scenario. Of course, I do not reject implementing update minresource in the code if we believe that modifying minresource is a frequent and necessary requirement.
  3. The current code is the ownerreference information of the podgroup is all pods specifying the pg.

Thanks for your reply, let me understand the PR further.

Regarding question 1, I understand that it is not a good choice to rely on the user's consciousness to deal with the abnormal scene of the function. We can guide the user to use the new function correctly through the manual or other methods, but for the use method that exceeds expectations, I understand that in It would be better to have a more comprehensive treatment on the platform side.

Question 2 is really not a high priority right now.

Regarding question 3, my previous understanding was biased, and I agree with your project development.

wangyang0616 avatar Mar 30 '23 06:03 wangyang0616

  1. strange

Q1: Agree with you. Even regardless of the controller, for relclaim and preempt action in the scheduler, choosing the maxinum minresource should be a better choice. I will update it late. Thank you

zbbkeepgoing avatar Mar 30 '23 07:03 zbbkeepgoing

In addition, it would be better if the design document or user manual can be updated.

wangyang0616 avatar Mar 30 '23 07:03 wangyang0616

In addition, it would be better if the design document or user manual can be updated.

Yeah, I add a user-guide doc. Please help to review it again.

zbbkeepgoing avatar Apr 03 '23 08:04 zbbkeepgoing

/unassign @Thor-wl

zbbkeepgoing avatar Apr 03 '23 09:04 zbbkeepgoing

/assign @wangyang0616

zbbkeepgoing avatar Apr 03 '23 09:04 zbbkeepgoing

In the scenario you describe, I agree to enhance the podgroup’s ability to manage a batch of independent pods, but I still have some concerns. MinResources can be configured through annotations. Will other attributes of the podgroup support this method in the future? Same as @Yikun considered, if they are all introduced, then does the crd of podgroup still need to exist?

cc @william-wang @jinzhejz

wangyang0616 avatar Apr 03 '23 11:04 wangyang0616

In the scenario you describe, I agree to enhance the podgroup’s ability to manage a batch of independent pods, but I still have some concerns. MinResources can be configured through annotations. Will other attributes of the podgroup support this method in the future? Same as @Yikun considered, if they are all introduced, then does the crd of podgroup still need to exist?

cc @william-wang @jinzhejz

For the former, I think there are suitable scenarios that we can add as needed, but for the latter, I hold different views. Because pg needs to exist as a group that manages pods, it is not just a simple CRD, it is the link between the scheduling units and the scheduling logic in the Volcano scheduler. I think this enhancement is just an increase in the PodGroup creation entry.

For example, VJ is currently the most native management entrance for PodGroups, and it may cover some batch scenarios. "spark.kubernetes.scheduler.volcano.podGroupTemplateFile" is the entry point for Spark to manage PodGroup natively, and it covers some big data scenarios. The current enhancement is to manage the PodGroup's entrance through the PodGroup's Controller, which covers some special scenarios.

Their existence has nothing to do with the PodGroup's CRD in nature, just different ways to creating and managing PodGroups. PodGroup still play a vital role in the scheduler. If we don't use PodGroup to play this link between Controller and Scheduler, just use Annotation to complete it. That is a challenge for the code complexity, scalability, and rationality of the Scheduler layer.

zbbkeepgoing avatar Apr 04 '23 03:04 zbbkeepgoing