spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Fix: Sync Volcano PodGroup Queue & Priority on Update

Open rahul810050 opened this issue 2 weeks ago • 1 comments

Summary

This PR fixes an issue #2526 in the Volcano batch scheduler integration where an existing PodGroup only updated minMember during re-sync but did not update:

  • spec.queue
  • spec.priorityClassName
  • spec.minResources

As a result, changing spec.batchSchedulerOptions.queue after an initial submission could leave the PodGroup stuck on a stale queue (e.g., default), even though the SparkApplication specified a new queue.

This PR ensures that the existing PodGroup is always kept fully in sync with the latest batchSchedulerOptions.

  1. Current Behavior (Before This Fix)

When spec.batchScheduler: "volcano" is set and the controller batch scheduler feature is enabled, the execution flow is:

  1. Controller calls"
submitSparkApplication
  1. That calls:
shouldDoBatchScheduling
  1. For batchScheduler = "volcano", the controller executes:
scheduler.Schedule(app)  // internal/scheduler/volcano/scheduler.go
  1. Inside Scheduler.Schedule:

    • For cluster mode, it calls:
syncPodGroupInClusterMode(app)
     - This function:
        - Computes total resources (driver + executor, or overridden by BatchSchedulerOptions.Resources)
        - Calls:
     ```bash
     s.syncPodGroup(app, 1, totalResource)
     ```
     
    - Then sets:
app.Spec.Driver.Annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"
app.Spec.Executor.Annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"
  1. Later, while building spark-submit arguments, these annotations are converted into:
--conf spark.kubernetes.{driver,executor}.annotation.*

So the actual pods are created with:

  • metadata.annotations["scheduling.k8s.io/group-name"] = "<app-name>-pg"
  • spec.schedulerName = "volcano" (set by sparkpod_defaulter.go from spec.batchScheduler)

Resulting Behavior

The operator creates exactly one PodGroup per SparkApplication:

getPodGroupName(app) = fmt.Sprintf("%s-pg", app.Name)

With:

  • ownerReference.kind = "SparkApplication"
  • spec.minMember
  • spec.minResources
  • spec.queue and spec.priorityClassName (only on initial creation)
  1. Root Cause of the Bug

The PodGroup creation & update logic lives in:

internal/scheduler/volcano/scheduler.go → syncPodGroup(...)

On Create (Correct Behavior)

podGroup := v1beta1.PodGroup{
    Spec: v1beta1.PodGroupSpec{
        MinMember:    size,
        MinResources: &minResource,
    },
}

if app.Spec.BatchSchedulerOptions != nil {
    if app.Spec.BatchSchedulerOptions.Queue != nil {
        podGroup.Spec.Queue = *app.Spec.BatchSchedulerOptions.Queue
    }
    if app.Spec.BatchSchedulerOptions.PriorityClassName != nil {
        podGroup.Spec.PriorityClassName = *app.Spec.BatchSchedulerOptions.PriorityClassName
    }
}

On Update (Buggy Behavior)

} else {
    if pg.Spec.MinMember != size {
        pg.Spec.MinMember = size
        _, err = s.volcanoClient.SchedulingV1beta1().
            PodGroups(namespace).
            Update(ctx, pg, metav1.UpdateOptions{})
    }
}

Problem

if:

  • A PodGroup <app-name>-pg already exists (e.g. created earlier using the default queue), and
  • The user updates:
spec:
  batchSchedulerOptions:
    queue: myqueue

The operator reuses the existing PodGroup but never updates its spec.queue, priorityClassName, or minResources.

This is a real operator bug.

  1. Why Two PodGroups Appear in the Cluster From the spark-operator codebase:
  • The Volcano scheduler only creates PodGroups owned by:
ownerReference.kind: SparkApplication
  • It never creates PodGroups owned by a Pod.

Therefore, the second PodGroup observed in the issue (ownerReference.kind = Pod, Inqueue) must come from:

Spark’s native Volcano integration (e.g., VolcanoFeatureStep + PodGroup template), or

Volcano auto-creating a PodGroup due to a different scheduling.k8s.io/group-name.

Resulting Symptom

SparkApplication-owned PodGroup → has queue = myqueue

Pod-owned PodGroup → still on queue = default

Pods bind to the Pod-owned PodGroup → default queue is used

This explains:

Duplicate PodGroups Queue being ignored Inconsistent scheduling behavior

  1. What This PR Changes Before
} else {
    if pg.Spec.MinMember != size {
        pg.Spec.MinMember = size
        _, err = s.volcanoClient.SchedulingV1beta1().PodGroups(namespace).Update(
            context.TODO(), pg, metav1.UpdateOptions{},
        )
    }
}

After (This PR)

} else {
		// podGroup already exists -- keep it in sync with SparkApplication spec.
		updated := false

		// ensure MinMember matches the desired size
		if pg.Spec.MinMember != size {
			pg.Spec.MinMember = size
			updated = true
		}

		if app.Spec.BatchSchedulerOptions != nil {
			opts := app.Spec.BatchSchedulerOptions

			// keep queue consistant with BatchSchedulerOptions.Queue.
			if opts.Queue != nil && pg.Spec.Queue != *opts.Queue {
				pg.Spec.Queue = *opts.Queue
				updated = true
			}

			// keep priorityClassName consistant with BatchSchedulerOptions.PriorityClassName.
			if opts.PriorityClassName != nil && pg.Spec.PriorityClassName != *opts.PriorityClassName {
				pg.Spec.PriorityClassName = *opts.PriorityClassName
				updated = true
			}

			// if explicit resources are provided, always override MinResources.
			if len(opts.Resources) > 0 {
				res := opts.Resources
				pg.Spec.MinResources = &res
				updated = true
			}
		}

		if updated {
			_, err = s.volcanoClient.SchedulingV1beta1().PodGroups(namespace).Update(
				context.TODO(), pg, metav1.UpdateOptions{},
			)
		}
	}

This guarantees that the PodGroup always reflects:

  • batchSchedulerOptions.queue
  • batchSchedulerOptions.priorityClassName
  • batchSchedulerOptions.resources

even across re-submissions and spec changes.

  1. Recommended Configuration to Avoid Duplicate PodGroups

Option A — Operator-managed Volcano (Recommended)

spec:
  batchScheduler: "volcano"
  batchSchedulerOptions:
    queue: "myqueue"

And in Helm values:

controller:
  batchScheduler:
    enable: true

Disable Spark’s native Volcano integration in the Spark image:

  • VolcanoFeatureStep
  • spark.kubernetes.scheduler.volcano.podGroupTemplateFile

Option B — Spark-managed Volcano

controller:
  batchScheduler:
    enable: false

And do NOT set:

spec:
  batchScheduler
  batchSchedulerOptions

Let Spark manage PodGroups via native configs.

Outcome of this PR

  • Fixes stale queue behavior
  • Keeps PodGroups fully in sync with SparkApplication spec
  • Prevents silent scheduling mismatches
  • Improves Volcano + spark-operator correctness

rahul810050 avatar Dec 08 '25 07:12 rahul810050

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign yuchaoran2011 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Dec 08 '25 07:12 google-oss-prow[bot]