training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Unnecessary hard check validation edge case for queue

Open abhijeet-dhumal opened this issue 6 months ago • 4 comments

What happened?

In Kubeflow-traning SDK version 1.9.3

If 'queue-name' parameter of TrainingClient's create_job() function provides 'local-queue' name which doesn't exist in given test namespace -

The user gets warning mentioned below but queue label addition gets skipped due to this below hard validation check : https://github.com/kubeflow/trainer/blob/52e41d5d124be36d7eb9754e47f7b21efe67b1ac/sdk/python/kubeflow/training/api/training_client.py#L603

For example :

import os
tc.create_job(
   name="test-pytorch",
   namespace="<test-namespace>",
   train_func=<train_func>,
   num_workers=1,
   queue_name="<non-existant-queue>",
)

Warning :  Queue '<non-existant-queue>' does not exist in namespace '<test-namespace>'. The job will be created but may not be managed by Kueue.

PytorchJob created without queue-label

What did you expect to happen?

The label addition should be allowed even if local-queue doesn't exist with above mentioned warning.

Expected Created PytorchJob via create_job() with non-existant-queue :

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
      kueue.x-k8s.io/queue-name: <non-existant-queue>

Workaround :

Create Job with labels parameter :

import os
tc.create_job(
   name="test-pytorch",
   namespace="<test-namespace>",
   train_func=<train_func>,
   num_workers=1,
   labels={
      "kueue.x-k8s.io/queue-name": "<non-existant-queue>",
   }
)

Environment

PytorchJob created in a K8s cluster

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

abhijeet-dhumal avatar Aug 06 '25 14:08 abhijeet-dhumal

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 04 '25 15:11 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Nov 24 '25 15:11 github-actions[bot]

/reopen

tenzen-y avatar Nov 25 '25 00:11 tenzen-y

@tenzen-y: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Nov 25 '25 00:11 google-oss-prow[bot]

Hi! I’d like to work on this issue if it’s still available.
I’m starting to investigate and will share progress soon.

Thesmoothengineer avatar Feb 05 '26 21:02 Thesmoothengineer