Fix: Sync Volcano PodGroup Queue & Priority on Update

Open rahul810050 opened this issue 2 weeks ago • 1 comments

Summary

This PR fixes an issue #2526 in the Volcano batch scheduler integration where an existing PodGroup only updated minMember during re-sync but did not update:

spec.queue
spec.priorityClassName
spec.minResources

As a result, changing spec.batchSchedulerOptions.queue after an initial submission could leave the PodGroup stuck on a stale queue (e.g., default), even though the SparkApplication specified a new queue.

This PR ensures that the existing PodGroup is always kept fully in sync with the latest batchSchedulerOptions.

Current Behavior (Before This Fix)

When spec.batchScheduler: "volcano" is set and the controller batch scheduler feature is enabled, the execution flow is:

Controller calls"

submitSparkApplication

That calls:

shouldDoBatchScheduling

For batchScheduler = "volcano", the controller executes:

scheduler.Schedule(app)  // internal/scheduler/volcano/scheduler.go

Inside Scheduler.Schedule:
- For cluster mode, it calls:

syncPodGroupInClusterMode(app)

     - This function:
        - Computes total resources (driver + executor, or overridden by BatchSchedulerOptions.Resources)
        - Calls:
     ```bash
     s.syncPodGroup(app, 1, totalResource)
     ```
     
    - Then sets:

app.Spec.Driver.Annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"
app.Spec.Executor.Annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"

Later, while building spark-submit arguments, these annotations are converted into:

--conf spark.kubernetes.{driver,executor}.annotation.*

So the actual pods are created with:

metadata.annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"
spec.schedulerName = "volcano" (set by sparkpod_defaulter.go from spec.batchScheduler)

Resulting Behavior

The operator creates exactly one PodGroup per SparkApplication:

getPodGroupName(app) = fmt.Sprintf("%s-pg", app.Name)

With:

ownerReference.kind = "SparkApplication"
spec.minMember
spec.minResources
spec.queue and spec.priorityClassName (only on initial creation)

Root Cause of the Bug

The PodGroup creation & update logic lives in:

internal/scheduler/volcano/scheduler.go → syncPodGroup(...)

On Create (Correct Behavior)

podGroup := v1beta1.PodGroup{
    Spec: v1beta1.PodGroupSpec{
        MinMember:    size,
        MinResources: &minResource,
    },
}

if app.Spec.BatchSchedulerOptions != nil {
    if app.Spec.BatchSchedulerOptions.Queue != nil {
        podGroup.Spec.Queue = *app.Spec.BatchSchedulerOptions.Queue
    }
    if app.Spec.BatchSchedulerOptions.PriorityClassName != nil {
        podGroup.Spec.PriorityClassName = *app.Spec.BatchSchedulerOptions.PriorityClassName
    }
}

On Update (Buggy Behavior)

} else {
    if pg.Spec.MinMember != size {
        pg.Spec.MinMember = size
        _, err = s.volcanoClient.SchedulingV1beta1().
            PodGroups(namespace).
            Update(ctx, pg, metav1.UpdateOptions{})
    }
}

Problem

if:

A PodGroup <app-name>-pg already exists (e.g. created earlier using the default queue), and
The user updates:

spec:
  batchSchedulerOptions:
    queue: myqueue

The operator reuses the existing PodGroup but never updates its spec.queue, priorityClassName, or minResources.

This is a real operator bug.

Why Two PodGroups Appear in the Cluster From the spark-operator codebase:

The Volcano scheduler only creates PodGroups owned by:

ownerReference.kind: SparkApplication

It never creates PodGroups owned by a Pod.

Therefore, the second PodGroup observed in the issue (ownerReference.kind = Pod, Inqueue) must come from:

Spark’s native Volcano integration (e.g., VolcanoFeatureStep + PodGroup template), or

Volcano auto-creating a PodGroup due to a different scheduling.k8s.io/group-name.

Resulting Symptom

SparkApplication-owned PodGroup → has queue = myqueue

Pod-owned PodGroup → still on queue = default

Pods bind to the Pod-owned PodGroup → default queue is used

This explains:

Duplicate PodGroups Queue being ignored Inconsistent scheduling behavior

What This PR Changes Before

} else {
    if pg.Spec.MinMember != size {
        pg.Spec.MinMember = size
        _, err = s.volcanoClient.SchedulingV1beta1().PodGroups(namespace).Update(
            context.TODO(), pg, metav1.UpdateOptions{},
        )
    }
}

After (This PR)

} else {
		// podGroup already exists -- keep it in sync with SparkApplication spec.
		updated := false

		// ensure MinMember matches the desired size
		if pg.Spec.MinMember != size {
			pg.Spec.MinMember = size
			updated = true
		}

		if app.Spec.BatchSchedulerOptions != nil {
			opts := app.Spec.BatchSchedulerOptions

			// keep queue consistant with BatchSchedulerOptions.Queue.
			if opts.Queue != nil && pg.Spec.Queue != *opts.Queue {
				pg.Spec.Queue = *opts.Queue
				updated = true
			}

			// keep priorityClassName consistant with BatchSchedulerOptions.PriorityClassName.
			if opts.PriorityClassName != nil && pg.Spec.PriorityClassName != *opts.PriorityClassName {
				pg.Spec.PriorityClassName = *opts.PriorityClassName
				updated = true
			}

			// if explicit resources are provided, always override MinResources.
			if len(opts.Resources) > 0 {
				res := opts.Resources
				pg.Spec.MinResources = &res
				updated = true
			}
		}

		if updated {
			_, err = s.volcanoClient.SchedulingV1beta1().PodGroups(namespace).Update(
				context.TODO(), pg, metav1.UpdateOptions{},
			)
		}
	}

This guarantees that the PodGroup always reflects:

batchSchedulerOptions.queue
batchSchedulerOptions.priorityClassName
batchSchedulerOptions.resources

even across re-submissions and spec changes.

Recommended Configuration to Avoid Duplicate PodGroups

Option A — Operator-managed Volcano (Recommended)

spec:
  batchScheduler: "volcano"
  batchSchedulerOptions:
    queue: "myqueue"

And in Helm values:

controller:
  batchScheduler:
    enable: true

Disable Spark’s native Volcano integration in the Spark image:

VolcanoFeatureStep
spark.kubernetes.scheduler.volcano.podGroupTemplateFile

Option B — Spark-managed Volcano

controller:
  batchScheduler:
    enable: false

And do NOT set:

spec:
  batchScheduler
  batchSchedulerOptions

Let Spark manage PodGroups via native configs.

Outcome of this PR

Fixes stale queue behavior
Keeps PodGroups fully in sync with SparkApplication spec
Prevents silent scheduling mismatches
Improves Volcano + spark-operator correctness

Dec 08 '25 07:12 rahul810050

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign yuchaoran2011 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Dec 08 '25 07:12 google-oss-prow[bot]