SparkApplication Creates Duplicate PodGroups and Ignores Custom Volcano Queue
SparkApplication Creates Duplicate PodGroups and Ignores Custom Volcano Queue
Hi,
I’m encountering an issue with my Spark custom resource definition (CRD) configuration when using Volcano as the batch scheduler. Specifically, even after specifying a custom queue in the SparkApplication spec, the job still defaults to the default queue during execution.
apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pi namespace: default spec: imagePullSecrets: - my-pull-secrets type: Scala mode: cluster image: our.repo/spark:volcano-latest imagePullPolicy: Always mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.3.jar sparkVersion: 3.5.3 restartPolicy: type: Never driver: cores: 1 memory: 1024m serviceAccount: spark-operator-spark executor: cores: 1 instances: 1 memory: 1024m batchScheduler: "volcano" batchSchedulerOptions: queue: "myqueue"
Then after applying the above yaml, i see that the job gets completed but I see 2 podgroups instead of 1:
kubectl get podgroups -n default
NAME STATUS MINMEMBER RUNNINGS AGE
podgroup1 Completed 1 2m36s podgroup2 Inqueue 1 2m25s
Expected behavior
The Spark job should show the specified queue: myqueue
Only one PodGroup should be created, owned by the SparkApplication resource.
Actual behavior
The Spark job runs successfully.
However, two PodGroups are created:
One with the ownerReference.kind set to Pod, which ends up in an Inqueue state.
Another associated with the SparkApplication, which reaches a Completed status.
It seems the PodGroup related to the actual SparkApplication is functioning as expected, but the additional PodGroup causes confusion and potential scheduling conflicts.
Environment & Versions
- Kubernetes Version: v1.28.7
- Spark Operator Version: 2.1.0
- Apache Spark Version: 3.5.3
Can you confirm if this is expected behavior or a bug?
Is there a workaround or additional configuration needed to ensure the correct queue is used and redundant PodGroups are avoided?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
Hi @sagarprst and @ChenYi015 I’ve opened a PR for this issue with a detailed explanation of the findings and the proposed fix. Please feel free to review #2759 whenever you get a chance. I’m happy to make any further changes if needed.
Thanks!