Fix: Sync Volcano PodGroup Queue & Priority on Update
Summary
This PR fixes an issue #2526 in the Volcano batch scheduler integration where an existing PodGroup only updated minMember during re-sync but did not update:
- spec.queue
- spec.priorityClassName
- spec.minResources
As a result, changing spec.batchSchedulerOptions.queue after an initial submission could leave the PodGroup stuck on a stale queue (e.g., default), even though the SparkApplication specified a new queue.
This PR ensures that the existing PodGroup is always kept fully in sync with the latest batchSchedulerOptions.
- Current Behavior (Before This Fix)
When spec.batchScheduler: "volcano" is set and the controller batch scheduler feature is enabled, the execution flow is:
- Controller calls"
submitSparkApplication
- That calls:
shouldDoBatchScheduling
- For
batchScheduler = "volcano", the controller executes:
scheduler.Schedule(app) // internal/scheduler/volcano/scheduler.go
-
Inside Scheduler.Schedule:
- For cluster mode, it calls:
syncPodGroupInClusterMode(app)
- This function:
- Computes total resources (driver + executor, or overridden by BatchSchedulerOptions.Resources)
- Calls:
```bash
s.syncPodGroup(app, 1, totalResource)
```
- Then sets:
app.Spec.Driver.Annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"
app.Spec.Executor.Annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"
- Later, while building
spark-submitarguments, these annotations are converted into:
--conf spark.kubernetes.{driver,executor}.annotation.*
So the actual pods are created with:
metadata.annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"spec.schedulerName = "volcano"(set bysparkpod_defaulter.gofromspec.batchScheduler)
Resulting Behavior
The operator creates exactly one PodGroup per SparkApplication:
getPodGroupName(app) = fmt.Sprintf("%s-pg", app.Name)
With:
ownerReference.kind = "SparkApplication"spec.minMemberspec.minResourcesspec.queueandspec.priorityClassName(only on initial creation)
- Root Cause of the Bug
The PodGroup creation & update logic lives in:
internal/scheduler/volcano/scheduler.go → syncPodGroup(...)
On Create (Correct Behavior)
podGroup := v1beta1.PodGroup{
Spec: v1beta1.PodGroupSpec{
MinMember: size,
MinResources: &minResource,
},
}
if app.Spec.BatchSchedulerOptions != nil {
if app.Spec.BatchSchedulerOptions.Queue != nil {
podGroup.Spec.Queue = *app.Spec.BatchSchedulerOptions.Queue
}
if app.Spec.BatchSchedulerOptions.PriorityClassName != nil {
podGroup.Spec.PriorityClassName = *app.Spec.BatchSchedulerOptions.PriorityClassName
}
}
On Update (Buggy Behavior)
} else {
if pg.Spec.MinMember != size {
pg.Spec.MinMember = size
_, err = s.volcanoClient.SchedulingV1beta1().
PodGroups(namespace).
Update(ctx, pg, metav1.UpdateOptions{})
}
}
Problem
if:
- A PodGroup
<app-name>-pgalready exists (e.g. created earlier using the default queue), and - The user updates:
spec:
batchSchedulerOptions:
queue: myqueue
The operator reuses the existing PodGroup but never updates its spec.queue, priorityClassName, or minResources.
This is a real operator bug.
- Why Two PodGroups Appear in the Cluster From the spark-operator codebase:
- The Volcano scheduler only creates PodGroups owned by:
ownerReference.kind: SparkApplication
- It never creates PodGroups owned by a Pod.
Therefore, the second PodGroup observed in the issue (ownerReference.kind = Pod, Inqueue) must come from:
Spark’s native Volcano integration (e.g., VolcanoFeatureStep + PodGroup template), or
Volcano auto-creating a PodGroup due to a different scheduling.k8s.io/group-name.
Resulting Symptom
SparkApplication-owned PodGroup → has queue = myqueue
Pod-owned PodGroup → still on queue = default
Pods bind to the Pod-owned PodGroup → default queue is used
This explains:
Duplicate PodGroups Queue being ignored Inconsistent scheduling behavior
- What This PR Changes Before
} else {
if pg.Spec.MinMember != size {
pg.Spec.MinMember = size
_, err = s.volcanoClient.SchedulingV1beta1().PodGroups(namespace).Update(
context.TODO(), pg, metav1.UpdateOptions{},
)
}
}
After (This PR)
} else {
// podGroup already exists -- keep it in sync with SparkApplication spec.
updated := false
// ensure MinMember matches the desired size
if pg.Spec.MinMember != size {
pg.Spec.MinMember = size
updated = true
}
if app.Spec.BatchSchedulerOptions != nil {
opts := app.Spec.BatchSchedulerOptions
// keep queue consistant with BatchSchedulerOptions.Queue.
if opts.Queue != nil && pg.Spec.Queue != *opts.Queue {
pg.Spec.Queue = *opts.Queue
updated = true
}
// keep priorityClassName consistant with BatchSchedulerOptions.PriorityClassName.
if opts.PriorityClassName != nil && pg.Spec.PriorityClassName != *opts.PriorityClassName {
pg.Spec.PriorityClassName = *opts.PriorityClassName
updated = true
}
// if explicit resources are provided, always override MinResources.
if len(opts.Resources) > 0 {
res := opts.Resources
pg.Spec.MinResources = &res
updated = true
}
}
if updated {
_, err = s.volcanoClient.SchedulingV1beta1().PodGroups(namespace).Update(
context.TODO(), pg, metav1.UpdateOptions{},
)
}
}
This guarantees that the PodGroup always reflects:
batchSchedulerOptions.queuebatchSchedulerOptions.priorityClassNamebatchSchedulerOptions.resources
even across re-submissions and spec changes.
- Recommended Configuration to Avoid Duplicate PodGroups
Option A — Operator-managed Volcano (Recommended)
spec:
batchScheduler: "volcano"
batchSchedulerOptions:
queue: "myqueue"
And in Helm values:
controller:
batchScheduler:
enable: true
Disable Spark’s native Volcano integration in the Spark image:
VolcanoFeatureStepspark.kubernetes.scheduler.volcano.podGroupTemplateFile
Option B — Spark-managed Volcano
controller:
batchScheduler:
enable: false
And do NOT set:
spec:
batchScheduler
batchSchedulerOptions
Let Spark manage PodGroups via native configs.
Outcome of this PR
- Fixes stale queue behavior
- Keeps PodGroups fully in sync with SparkApplication spec
- Prevents silent scheduling mismatches
- Improves Volcano + spark-operator correctness
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign yuchaoran2011 for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment