spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

SparkApplication Creates Duplicate PodGroups and Ignores Custom Volcano Queue

Open sagarprst opened this issue 7 months ago • 5 comments

SparkApplication Creates Duplicate PodGroups and Ignores Custom Volcano Queue

Hi,

I’m encountering an issue with my Spark custom resource definition (CRD) configuration when using Volcano as the batch scheduler. Specifically, even after specifying a custom queue in the SparkApplication spec, the job still defaults to the default queue during execution.

apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pi namespace: default spec: imagePullSecrets: - my-pull-secrets type: Scala mode: cluster image: our.repo/spark:volcano-latest imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar sparkVersion: 3.5.3 restartPolicy: type: Never driver: cores: 1 memory: 1024m serviceAccount: spark-operator-spark executor: cores: 1 instances: 1 memory: 1024m batchScheduler: "volcano" batchSchedulerOptions: queue: "myqueue"

Then after applying the above yaml, i see that the job gets completed but I see 2 podgroups instead of 1:

kubectl get podgroups -n default

NAME STATUS MINMEMBER RUNNINGS AGE

podgroup1 Completed 1 2m36s podgroup2 Inqueue 1 2m25s

Expected behavior

The Spark job should show the specified queue: myqueue

Only one PodGroup should be created, owned by the SparkApplication resource.

Actual behavior

The Spark job runs successfully.

However, two PodGroups are created:

One with the ownerReference.kind set to Pod, which ends up in an Inqueue state.

Another associated with the SparkApplication, which reaches a Completed status.

It seems the PodGroup related to the actual SparkApplication is functioning as expected, but the additional PodGroup causes confusion and potential scheduling conflicts.

Environment & Versions

  • Kubernetes Version: v1.28.7
  • Spark Operator Version: 2.1.0
  • Apache Spark Version: 3.5.3

Can you confirm if this is expected behavior or a bug?

Is there a workaround or additional configuration needed to ensure the correct queue is used and redundant PodGroups are avoided?

sagarprst avatar May 07 '25 14:05 sagarprst

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 05 '25 16:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Aug 25 '25 16:08 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 24 '25 02:11 github-actions[bot]

/lifecycle frozen

ChenYi015 avatar Nov 24 '25 06:11 ChenYi015

Hi @sagarprst and @ChenYi015 I’ve opened a PR for this issue with a detailed explanation of the findings and the proposed fix. Please feel free to review #2759 whenever you get a chance. I’m happy to make any further changes if needed.

Thanks!

rahul810050 avatar Dec 08 '25 07:12 rahul810050