kueue icon indicating copy to clipboard operation
kueue copied to clipboard

[MultiKueue] Report a ClusterQueue as inactive (misconfigured) if there is ProvReq used with MK

Open mimowo opened this issue 1 year ago • 18 comments

What would you like to be added:

Validation for ClusterQueue, if there is a MK and ProvReq admission check configured.

Why is this needed:

Provisioning nodes on the management cluster does not make sense. We want to fail fast, and warn user about possibly wasted money to scale-up the cluster.

Proposed approach:

Use a mechanism similar to the one here: https://github.com/kubernetes-sigs/kueue/pull/1635.

mimowo avatar Apr 19 '24 14:04 mimowo

/assign @trasc /cc @alculquicondor

mimowo avatar Apr 19 '24 14:04 mimowo

I reviewed https://github.com/kubernetes-sigs/kueue/pull/2047, and I think we could follow the pattern here.

The AdmissionCheck condition would be CompatibleWithMultiKueue, and the reason for inactive ClusterQueue could be AdmissionCheckNonCompatibleWithMultiKueue. We would do the check inside updateWithAdmissionChecks as for other checks.

mimowo avatar Apr 26 '24 08:04 mimowo

The only problem is that the condition would be specific to MultiKueue. What if other checks need similar semantics against others?

I would rather sit on this one for now until we observe more admission checks, in-tree or out-of-tree.

alculquicondor avatar Apr 26 '24 13:04 alculquicondor

What if other checks need similar semantics against others?

Right, this approach cannot be used for arbitrary pairs of admission checks. However, MultiKueue seems more than an admission check. For example, it has a global configuration in the config map link.

I would rather sit on this one for now until we observe more admission checks, in-tree or out-of-tree.

I see, but it can take a long time until we have other pairs of AdmissionChecks which don't like each other, and having some protection before graduating MK and ProvReq to Beta would be nice.

The approach using the existing mechanism should be very quick to implement, and if one day we have a more generic mechanism, developed for the needs of other AC pairs, then we could switch to it.

mimowo avatar May 07 '24 16:05 mimowo

Let's wait and see

alculquicondor avatar May 07 '24 17:05 alculquicondor

/assign

vladikkuzn avatar May 16 '24 16:05 vladikkuzn

/unassign

vladikkuzn avatar May 17 '24 07:05 vladikkuzn

/assign

bouaouda-achraf avatar Jul 07 '24 19:07 bouaouda-achraf

@mimowo I think we don't have a proper design for this. And it hasn't proved to be very useful. Should we close it?

alculquicondor avatar Jul 08 '24 12:07 alculquicondor

I'm ok to close it until we revisit the design or some evidence for users running into this issue.

mimowo avatar Jul 08 '24 13:07 mimowo

/close

alculquicondor avatar Jul 08 '24 14:07 alculquicondor

@alculquicondor: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 08 '24 14:07 k8s-ci-robot

/reopen I believe with the recent changes (https://github.com/kubernetes-sigs/kueue/pull/3254) to make cache aware of the MultiKueue and ProvisioningRequest AdmissionChecks we can easily validate this conditions. cc @mbobrovskyi @mszadkow

mimowo avatar Dec 06 '24 15:12 mimowo

@mimowo: Reopened this issue.

In response to this:

/reopen I believe with the recent changes (https://github.com/kubernetes-sigs/kueue/pull/3254) to make cache aware of the MultiKueue and ProvisioningRequest AdmissionChecks we can easily validate this conditions. cc @mbobrovskyi @mszadkow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Dec 06 '24 15:12 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 06 '25 16:03 k8s-triage-robot

/remove-lifecycle stale

mimowo avatar Mar 06 '25 16:03 mimowo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 04 '25 16:06 k8s-triage-robot

Hitting this and unsure why this is happening. Any insights?

Kubectl describe workload

Status:
  Conditions:
    Last Transition Time:  2025-06-16T21:48:02Z
    Message:               ClusterQueue cluster-queue is inactive
    Observed Generation:   1
    Reason:                Inadmissible
    Status:                False
    Type:                  QuotaReserved
k get clusterqueue
NAME            COHORT   PENDING WORKLOADS
cluster-queue            31

k get localqueue
NAME               CLUSTERQUEUE    PENDING WORKLOADS   ADMITTED WORKLOADS
multislice-queue   cluster-queue   31                  0

samos123 avatar Jun 16 '25 21:06 samos123

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 16 '25 22:07 k8s-triage-robot

/remove-lifecycle rotten

mimowo avatar Aug 07 '25 08:08 mimowo

@samos123 what is your CQ configuration? Maybe provide the entire kubectl describe for the CQ

mimowo avatar Aug 07 '25 08:08 mimowo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 05 '25 08:11 k8s-triage-robot

/remove-lifecycle stale

mimowo avatar Nov 05 '25 08:11 mimowo