volcano
volcano copied to clipboard
Managed and scheduled multiple single pods with podgroup
For solve this situation
-
When we use spark client model to submit spark job in k8s with volcano scheduler. hope all executors can be managed and scheduled through podgroup.
-
hope some single pods can be managed and scheduled through podgroup with volcano.
Link to the api pull request: https://github.com/volcano-sh/apis/pull/83. I will update go.mod after pr of api is merged
Hey. Kindly to ask some questions about the background:
- Why do you want to separate the whole spark job to different and independent pods instead of making use of spark job or volcano job?
- I've not had a try about that. But as what I understand, Volcano has supported to generate PodGroup for separate pods. Is there something wrong with your test?
1、In Spark's Client submit mode. all the executor pod is independent, if we want to manage them by volcano, it need to group them in a ng. And a large part of the current spark users are also submitted using the client mode. 2、Yes, volcano support generate pg for separate pods, but not support pg's minresource not support, and this pr will support it. In addition, the generated podgroup only refer the first pod, if the first pod terminated, pg will be delete auto.
So the pr is support minresource for independent pods, and make pg refer to all the independent pods.
/retest
@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test
message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
please compress multiple commits into a single commit.
please compress multiple commits into a single commit.
done
/retest
@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test
message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Can this pull request be merged into master? @Thor-wl
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
To complete the pull request process, please assign hwdef
You can assign the PR to them by writing /assign @hwdef
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/retest
@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test
message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@jiangkaihua Can this pull request be merged? We discussed this pr in the past week
@jiangkaihua Can this pull request be merged? We discussed this pr in the past week
Could you please help fix the code check? It seems that some UT cases did not go through.
@jiangkaihua Can this pull request be merged? We discussed this pr in the past week
Could you please help fix the code check? It seems that some UT cases did not go through.
it's looks no problem, it shound be a ramdon exception, which has nothing to do with this pr, encountered before
/cc @Yikun Can you help evaluate the rationality of the demand and scenario? Thanks!
/retest
@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test
message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@zbbkeepgoing Thanks for contribution!
IIUC, this PR adds ability to help podgroup auto-creation with minResource hint injection using pod annotation when first pod is created.
Currently, in Spark client, users need to create the podgroup outside (out of volcano), and specified annotation in executor pods.
By this PR:
- Advantages: Help user to create the podgroup with annotation inside volcano (rather than outside volcano).
- Disadvantages: the annotations style (especilly for complex json) might a lite way but not a good way in CRD world. If user want to use all volcano podgroup feature. We should also support like
minMember
,priorityClassName
. After all features supporte, this might bring some comlex user interface to users.And also after this PR the spark cluster mode can also use this ability to create driver podgroup, and assgin executors to executor podgroup.
So, if we can accept pod annotation as the API to create podgroup, then I +1 this PR. Otherwise, I am -0 (not a big disagree) on this.
(Think aloud: If it's accepted, does this mean that the PodGroup CRD is redundant in some level? Because annotation can help us do everything.)
also cc @william-wang
@william-wang Hi,leibo. Any questions or suggestions about this pr?
@william-wang @Thor-wl If there is no doubt about this PR, can we merge it as soon as possible? This PR has been hanging for a long time. At present, I see other users in the Wechat group who also have this requirement.
/assign @william-wang
I think there might be some issues here that are not very clear:
-
If the minresource information is injected into the podgroup through annotation, if the minresource information marked in each pod is different, how should it be handled? At the application level, should the minresource information of the first pod be directly selected, or should it be selected? What about the maximum minresource information of all pods under the podgroup?
-
If the minresource information of the pod changes, does Volcano need to process it?
-
The ownerreference information of the podgroup is automatically associated. For example, when an independent pod is created, the ownerreference information of the podgroup is the pod. If multiple pods belong to one replicas, the ownerreference information of the podgroup is the replicas. For the current scenario, multiple unreferenced The independent pods related to each other are placed in a podgroup. How to set the ownerreference information? If it is set to the first pod, then after the first pod is deleted directly, the current podgroup will be automatically deleted, and other pods may still be deleted. Is there a problem if the execution has not been completed or has not been scheduled?
Hope to get more discussion, thanks!
I think there might be some issues here that are not very clear:
- If the minresource information is injected into the podgroup through annotation, if the minresource information marked in each pod is different, how should it be handled? At the application level, should the minresource information of the first pod be directly selected, or should it be selected? What about the maximum minresource information of all pods under the podgroup?
- If the minresource information of the pod changes, does Volcano need to process it?
- The ownerreference information of the podgroup is automatically associated. For example, when an independent pod is created, the ownerreference information of the podgroup is the pod. If multiple pods belong to one replicas, the ownerreference information of the podgroup is the replicas. For the current scenario, multiple unreferenced The independent pods related to each other are placed in a podgroup. How to set the ownerreference information? If it is set to the first pod, then after the first pod is deleted directly, the current podgroup will be automatically deleted, and other pods may still be deleted. Is there a problem if the execution has not been completed or has not been scheduled?
Hope to get more discussion, thanks!
- First, the minresource in annotations is bound to the pod group name in annotations, if user not specify pod group name, minresource annotation will not used. In addition, what is provided here is an extension capability for special but common scenarios. The strange behavior of users should not be considered. If i am a user, i need to use this ability normally, I should follow this rule.
- Personally, I don’t think it needs to be considered for the time being, and the current Controller logic does not consider changing the podgroupname after the podgroup is automatically created by a single pod. In addition, I think that minresource‘s updated should be more important in Scheduler. At present, if the minresource of an inqueue podgroup is increased, and the resources of the queue are not enough at this time, should the podgroup be rolled back to pending? The above are some of my personal understanding. Same as the meaning of the first point, this is a solution in a special but common scenario. Of course, I do not reject implementing update minresource in the code if we believe that modifying minresource is a frequent and necessary requirement.
- The current code is the ownerreference information of the podgroup is all pods specifying the pg.
- First, the minresource in annotations is bound to the pod group name in annotations, if user not specify pod group name, minresource annotation will not used. In addition, what is provided here is an extension capability for special but common scenarios. The strange behavior of users should not be considered. If i am a user, i need to use this ability normally, I should follow this rule.
- Personally, I don’t think it needs to be considered for the time being, and the current Controller logic does not consider changing the podgroupname after the podgroup is automatically created by a single pod. In addition, I think that minresource‘s updated should be more important in Scheduler. At present, if the minresource of an inqueue podgroup is increased, and the resources of the queue are not enough at this time, should the podgroup be rolled back to pending? The above are some of my personal understanding. Same as the meaning of the first point, this is a solution in a special but common scenario. Of course, I do not reject implementing update minresource in the code if we believe that modifying minresource is a frequent and necessary requirement.
- The current code is the ownerreference information of the podgroup is all pods specifying the pg.
Thanks for your reply, let me understand the PR further.
Regarding question 1, I understand that it is not a good choice to rely on the user's consciousness to deal with the abnormal scene of the function. We can guide the user to use the new function correctly through the manual or other methods, but for the use method that exceeds expectations, I understand that in It would be better to have a more comprehensive treatment on the platform side.
Question 2 is really not a high priority right now.
Regarding question 3, my previous understanding was biased, and I agree with your project development.
- strange
Q1: Agree with you. Even regardless of the controller, for relclaim and preempt action in the scheduler, choosing the maxinum minresource should be a better choice. I will update it late. Thank you
In addition, it would be better if the design document or user manual can be updated.
In addition, it would be better if the design document or user manual can be updated.
Yeah, I add a user-guide doc. Please help to review it again.
/unassign @Thor-wl
/assign @wangyang0616
In the scenario you describe, I agree to enhance the podgroup’s ability to manage a batch of independent pods, but I still have some concerns. MinResources can be configured through annotations. Will other attributes of the podgroup support this method in the future? Same as @Yikun considered, if they are all introduced, then does the crd of podgroup still need to exist?
cc @william-wang @jinzhejz
In the scenario you describe, I agree to enhance the podgroup’s ability to manage a batch of independent pods, but I still have some concerns. MinResources can be configured through annotations. Will other attributes of the podgroup support this method in the future? Same as @Yikun considered, if they are all introduced, then does the crd of podgroup still need to exist?
cc @william-wang @jinzhejz
For the former, I think there are suitable scenarios that we can add as needed, but for the latter, I hold different views. Because pg needs to exist as a group that manages pods, it is not just a simple CRD, it is the link between the scheduling units and the scheduling logic in the Volcano scheduler. I think this enhancement is just an increase in the PodGroup creation entry.
For example, VJ is currently the most native management entrance for PodGroups, and it may cover some batch scenarios. "spark.kubernetes.scheduler.volcano.podGroupTemplateFile" is the entry point for Spark to manage PodGroup natively, and it covers some big data scenarios. The current enhancement is to manage the PodGroup's entrance through the PodGroup's Controller, which covers some special scenarios.
Their existence has nothing to do with the PodGroup's CRD in nature, just different ways to creating and managing PodGroups. PodGroup still play a vital role in the scheduler. If we don't use PodGroup to play this link between Controller and Scheduler, just use Annotation to complete it. That is a challenge for the code complexity, scalability, and rationality of the Scheduler layer.