volcano Does VolcanoFeatureStep works with deploy-mode client in Kubernetes?

What happened: I am facing an issue using Volcano deployment, specifically when using the VolcanoFeatureStep.I have found the following issue https://github.com/volcano-sh/volcano/issues/2832 where it was asked if Volcano worked with client mode deployment, but no response was given to it. So I will ask again.

Deploying a vcjob, and from that vcjob, I have a container that is my driver, which contains the spark-submit command. When attempting to do so, I run into an error accusing some specific podgroup, just like the post mentioned above, which is not defined anywhere in my code, even if I attempt to define one directly into my pod group template, it is a different one.

The error seems to be coming from when attempting to deploy the executors of my workload. Error message:

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://api.<masked-environment-endpoint>/api/v1/namespaces/04522055-15b3-40d8-ba07-22b1a2a5ffcc/pods. Message: admission webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/cat-podgroup-driver-0-exec-789>: podgroups.scheduling.volcano.sh "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found. Received status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/cat-podgroup-driver-0-exec-789>: podgroups.scheduling.volcano.sh "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:538)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:558)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:349)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:711)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93)
	at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1113)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:440)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:417)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:370)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:363)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:363)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:134)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:134)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:143)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:131)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:85)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:182)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:296)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:838)

Accused podgroup on the error message: spark-5ad570e340934d3997065fa6d504910e-podgroup

Found this at Spark repo: https://github.com/apache/spark/blob/0e689611f09968c3a46689294184de29d097302b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/VolcanoFeatureStep.scala#L32

What you expected to happen:

I was expecting that the Driver and Executors would be assigned to the same pod group created using the PodGroupTemplate, using this VolcanoFeatureStep, and no errors to be found. But only the Driver is deployed and the mentioned above error is found on the pod logs.

How to reproduce it (as minimally and precisely as possible):

Using a Spark version 3.4.1 submit a new spark workload, providing the following:

--deploy-mode="client"
--class "org.apache.spark.examples.SparkPi"
file:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar

Provide also the necessary configuration for the VolcanoFeatureStep to work:

--conf spark.kubernetes.driver.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep" 
--conf spark.kubernetes.executor.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep" 
--conf spark.kubernetes.scheduler.volcano.podGroupTemplateFile="podgroup.yaml"

I am creating a PodGroup like the following:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: pod-group-test
spec:
  minResources:
    cpu: "2"
    memory: "2Gi"
  queue: my-queue

Anything else we need to know?:

Spark version 3.4.1

Environment:

Volcano Version: v1.8.0
Kubernetes version (use kubectl version): v1.26.7
Cloud provider or hardware configuration: GCP
OS (e.g. from /etc/os-release): Don't have the specifics
Kernel (e.g. uname -a): Don't have the specifics
Install tools: Don't have the specifics
Others:

Dec 06 '23 21:12 duhaesbaert

@Yikun Can you take a look? I'm not so familiar with spark here.

Dec 08 '23 07:12 Monokaix

Thanks ping me!

The VolcanoFeaturestep only work with cluster mode, user should manage the pdogroup manually in the client mode .

IIRC, in spark client mode, all driver side feature step pipeline to be skiped that means the podgroup haven't a chance to be created.

So in client mode, the user should create the podgroup out of spark manually (just like launching the client manually), and make sure the client pod and executor pods annotation bind to manual created podgroup (just like featurestep behavior).

Dec 09 '23 09:12 Yikun

Hi @Yikun, thank you for replying back.

So, let me just ask one additional question, so as you mentioned, the client mode is not supported by VolcanmoFeatureStep, which so we would need to manage the PodGroups and it's lifecycle manually right? In other words, I would have to manually also delete those PodGroups once I am no longer using them?

Additionally, as I mentioned, we are using spark-submit, and I would like to confirm with you about the order of things as I will then implement those on my application. If my understanding is correct, I would have to do the following:

Create a PodGroup that I will associate with Drivers and Executors
And then as I execute the spark-submit command, I should provide spark.kubernetes.executor.annotation.[AnnotationName] so my executors created will contain the annotation to be associated with such podGroup. Is my understanding correct?

Any additional step that should be done manually to associate them(drivers and executors) into the PodGroups?

Dec 11 '23 12:12 duhaesbaert

One additional question on this matter: The reason why I've opened this issue is because I have not seen any indication on the documentation that VolcanoFeatureStep does not work when deploying spark jobs in client mode, on both spark and volcano. Is that true, or is there some documentation that I am not aware of, where this information lies on?

And something to propose, not sure if here or on Spark: but for when using VolcanoFeatureStep, but deployment mode as client, for a message to be returned, or logged, with more explicit information about client mode not being supported. I don't recall seeing that on the logs.

And once again, thank you for confirming our suspicion.

Dec 11 '23 17:12 duhaesbaert

Hi @Yikun, thank you for replying back.

So, let me just ask one additional question, so as you mentioned, the client mode is not supported by VolcanmoFeatureStep, which so we would need to manage the PodGroups and it's lifecycle manually right? In other words, I would have to manually also delete those PodGroups once I am no longer using them?

Additionally, as I mentioned, we are using spark-submit, and I would like to confirm with you about the order of things as I will then implement those on my application. If my understanding is correct, I would have to do the following:

Create a PodGroup that I will associate with Drivers and Executors

And then as I execute the spark-submit command, I should provide spark.kubernetes.executor.annotation.[AnnotationName] so my executors created will contain the annotation to be associated with such podGroup. Is my understanding correct?

Any additional step that should be done manually to associate them(drivers and executors) into the PodGroups?

Were you actually able to get spark with client mode working by adding labels/annotations to both running spark executors and labels to podgroup?

Dec 16 '23 00:12 apinchuk1

Hi @Yikun, thank you for replying back. So, let me just ask one additional question, so as you mentioned, the client mode is not supported by VolcanmoFeatureStep, which so we would need to manage the PodGroups and it's lifecycle manually right? In other words, I would have to manually also delete those PodGroups once I am no longer using them? Additionally, as I mentioned, we are using spark-submit, and I would like to confirm with you about the order of things as I will then implement those on my application. If my understanding is correct, I would have to do the following:

Create a PodGroup that I will associate with Drivers and Executors

And then as I execute the spark-submit command, I should provide spark.kubernetes.executor.annotation.[AnnotationName] so my executors created will contain the annotation to be associated with such podGroup. Is my understanding correct?

Any additional step that should be done manually to associate them(drivers and executors) into the PodGroups?

Were you actually able to get spark with client mode working by adding labels/annotations to both running spark executors and labels to podgroup?

Hi @apinchuk1 yes, that is correct, it works perfectly just assigning the podgroup name as a label into the executors. I have created a logic of my own were I can do this already on my own spark-submit command. It is working like a charm!

Apr 29 '24 21:04 duhaesbaert

Thank you for the confirmation. Closing Issue.

Apr 29 '24 21:04 duhaesbaert