Does VolcanoFeatureStep works with deploy-mode client in Kubernetes?
What happened: I am facing an issue using Volcano deployment, specifically when using the VolcanoFeatureStep.I have found the following issue https://github.com/volcano-sh/volcano/issues/2832 where it was asked if Volcano worked with client mode deployment, but no response was given to it. So I will ask again.
Deploying a vcjob, and from that vcjob, I have a container that is my driver, which contains the spark-submit command. When attempting to do so, I run into an error accusing some specific podgroup, just like the post mentioned above, which is not defined anywhere in my code, even if I attempt to define one directly into my pod group template, it is a different one.
The error seems to be coming from when attempting to deploy the executors of my workload. Error message:
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://api.<masked-environment-endpoint>/api/v1/namespaces/04522055-15b3-40d8-ba07-22b1a2a5ffcc/pods. Message: admission webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/cat-podgroup-driver-0-exec-789>: podgroups.scheduling.volcano.sh "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found. Received status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/cat-podgroup-driver-0-exec-789>: podgroups.scheduling.volcano.sh "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:538)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:558)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:349)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:711)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93)
at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1113)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:440)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:417)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:370)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:363)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:363)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:134)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:134)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:143)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:131)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:85)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:182)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:296)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:838)
Accused podgroup on the error message: spark-5ad570e340934d3997065fa6d504910e-podgroup
Found this at Spark repo: https://github.com/apache/spark/blob/0e689611f09968c3a46689294184de29d097302b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/VolcanoFeatureStep.scala#L32
What you expected to happen:
I was expecting that the Driver and Executors would be assigned to the same pod group created using the PodGroupTemplate, using this VolcanoFeatureStep, and no errors to be found. But only the Driver is deployed and the mentioned above error is found on the pod logs.
How to reproduce it (as minimally and precisely as possible):
- Using a Spark version 3.4.1 submit a new spark workload, providing the following:
--deploy-mode="client"
--class "org.apache.spark.examples.SparkPi"
file:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar
Provide also the necessary configuration for the VolcanoFeatureStep to work:
--conf spark.kubernetes.driver.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep"
--conf spark.kubernetes.executor.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep"
--conf spark.kubernetes.scheduler.volcano.podGroupTemplateFile="podgroup.yaml"
- I am creating a PodGroup like the following:
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: pod-group-test
spec:
minResources:
cpu: "2"
memory: "2Gi"
queue: my-queue
Anything else we need to know?:
Spark version 3.4.1
Environment:
- Volcano Version: v1.8.0
- Kubernetes version (use
kubectl version): v1.26.7 - Cloud provider or hardware configuration: GCP
- OS (e.g. from /etc/os-release): Don't have the specifics
- Kernel (e.g.
uname -a): Don't have the specifics - Install tools: Don't have the specifics
- Others:
@Yikun Can you take a look? I'm not so familiar with spark here.
Thanks ping me!
The VolcanoFeaturestep only work with cluster mode, user should manage the pdogroup manually in the client mode .
IIRC, in spark client mode, all driver side feature step pipeline to be skiped that means the podgroup haven't a chance to be created.
So in client mode, the user should create the podgroup out of spark manually (just like launching the client manually), and make sure the client pod and executor pods annotation bind to manual created podgroup (just like featurestep behavior).
Hi @Yikun, thank you for replying back.
So, let me just ask one additional question, so as you mentioned, the client mode is not supported by VolcanmoFeatureStep, which so we would need to manage the PodGroups and it's lifecycle manually right? In other words, I would have to manually also delete those PodGroups once I am no longer using them?
Additionally, as I mentioned, we are using spark-submit, and I would like to confirm with you about the order of things as I will then implement those on my application. If my understanding is correct, I would have to do the following:
- Create a PodGroup that I will associate with Drivers and Executors
- And then as I execute the spark-submit command, I should provide
spark.kubernetes.executor.annotation.[AnnotationName]so my executors created will contain the annotation to be associated with such podGroup. Is my understanding correct?
Any additional step that should be done manually to associate them(drivers and executors) into the PodGroups?
One additional question on this matter: The reason why I've opened this issue is because I have not seen any indication on the documentation that VolcanoFeatureStep does not work when deploying spark jobs in client mode, on both spark and volcano. Is that true, or is there some documentation that I am not aware of, where this information lies on?
And something to propose, not sure if here or on Spark: but for when using VolcanoFeatureStep, but deployment mode as client, for a message to be returned, or logged, with more explicit information about client mode not being supported. I don't recall seeing that on the logs.
And once again, thank you for confirming our suspicion.
Hi @Yikun, thank you for replying back.
So, let me just ask one additional question, so as you mentioned, the client mode is not supported by VolcanmoFeatureStep, which so we would need to manage the PodGroups and it's lifecycle manually right? In other words, I would have to manually also delete those PodGroups once I am no longer using them?
Additionally, as I mentioned, we are using
spark-submit, and I would like to confirm with you about the order of things as I will then implement those on my application. If my understanding is correct, I would have to do the following:
- Create a PodGroup that I will associate with Drivers and Executors
- And then as I execute the spark-submit command, I should provide
spark.kubernetes.executor.annotation.[AnnotationName]so my executors created will contain the annotation to be associated with such podGroup. Is my understanding correct?Any additional step that should be done manually to associate them(drivers and executors) into the PodGroups?
Were you actually able to get spark with client mode working by adding labels/annotations to both running spark executors and labels to podgroup?
Hi @Yikun, thank you for replying back. So, let me just ask one additional question, so as you mentioned, the client mode is not supported by VolcanmoFeatureStep, which so we would need to manage the PodGroups and it's lifecycle manually right? In other words, I would have to manually also delete those PodGroups once I am no longer using them? Additionally, as I mentioned, we are using
spark-submit, and I would like to confirm with you about the order of things as I will then implement those on my application. If my understanding is correct, I would have to do the following:
- Create a PodGroup that I will associate with Drivers and Executors
- And then as I execute the spark-submit command, I should provide
spark.kubernetes.executor.annotation.[AnnotationName]so my executors created will contain the annotation to be associated with such podGroup. Is my understanding correct?Any additional step that should be done manually to associate them(drivers and executors) into the PodGroups?
Were you actually able to get spark with client mode working by adding labels/annotations to both running spark executors and labels to podgroup?
Hi @apinchuk1 yes, that is correct, it works perfectly just assigning the podgroup name as a label into the executors. I have created a logic of my own were I can do this already on my own spark-submit command. It is working like a charm!
Thank you for the confirmation. Closing Issue.