enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

DRA: ReservedFor Workloads

Open johnbelamaric opened this issue 1 year ago • 54 comments

Enhancement Description

Currently, when the scheduler allocates a ResourceClaim for a given Pod, it adds that Pod to the ResourceClaimStatus.ReservedFor list. A claim shared among multiple pods will have multiple entries in this list. This allows the ResourceClaimController to know when to de-allocate the claim; it does so once this list is empty.

The length of this list is limited to 256 pods. However, some workloads are much larger and may share a resource claim across many more pods, even in the thousands. Simply increasing the pod list to thousands is not a good long term solution.

Instead, this proposal is to allow us to reserve it for a workload. For example, rather than listing individual pods, you could list the Job or ReplicaSet or StatefulSet that is sharing the ResourceClaim. This avoids race conditions as pods come and go, without requiring listing every pod.

cc @pohly @klueska @thebinaryone1

  • One-line enhancement description (can be used as a release note): Enable resource claims to be reserved for workloads, not just individual pods
  • Kubernetes Enhancement Proposal: https://github.com/kubernetes/enhancements/pull/5379
  • Discussion Link: https://kubernetes.slack.com/archives/C0409NGC1TK/p1741647231531399
  • Primary contact (assignee): @johnbelamaric
  • Responsible SIGs: sig-scheduling
  • Enhancement target (which target equals to which milestone):
    • Alpha release target (x.y): 1.34
    • Beta release target (x.y): 1.35
    • Stable release target (x.y): 1.36
  • [ ] Alpha
    • [ ] KEP (k/enhancements) update PR(s):
      • [ ] https://github.com/kubernetes/enhancements/pull/5379
    • [ ] Code (k/k) update PR(s):
    • [ ] Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

johnbelamaric avatar Mar 11 '25 18:03 johnbelamaric

/sig scheduling /sig node /wg device-management

johnbelamaric avatar Mar 11 '25 18:03 johnbelamaric

/retitle DRA: ReservedFor Workloads

johnbelamaric avatar Mar 11 '25 18:03 johnbelamaric

Two current options @pohly and I have been discussing with respect to this.

Option One: Add API for this

  • claim.spec.reservedFor *ResourceClaimConsumerReference "If set, the scheduler will reserve the claim for this consumer instead of the pod which triggered allocation. It is the responsibility of whoever set up the claim like this to remove this consumer from from the status.reservedFor when the claim can be deallocated. They may also remove the allocation at the same time, otherwise the resource claim controller will do that."
  • claim.status.allocation.reservedForAnyPod bool "If true, then the scheduler will schedule pods referencing the claim without adding them to reservedFor. The kubelet will not check whether a pod is listed as consumer of the claim when starting the pod."

In this option, the scheduler sets the boolean to true if the spec field is populated, and it copies the spec field to the status.

Option Two: No API, more implicit

In this option, the workload controller that is creating the resource claim updates the status after creating the claim, but before creating any pods referencing the claim:

  1. Workload controller creates the resource claim
  2. Workload controller updates the status on the resource claim with itself; since no pods reference it yet, scheduler won't yet have touched the status. This would require relaxing the current validation which only allows allocated claims to be reserved.
  3. Workload controller creates pods pointing to the RC
  4. Scheduler sees the pod referencing the RC, sees non-pods in the status reservedFor, and so proceeds with allocation without adding the pod
  5. kubelet sees non-pod in the reserved for, so it doesn't do the sanity check
  6. ResourceClaim controller doesn't de-allocate unless the status reservedFor is empty, it is the responsibility of the workload controller to clear the reservedFor (as in the first option).

With either option, we can tighten the ResoruceClaimController behavior up a bit to avoid orphans by letting it follow the references in the reservedFor and remove any that are to objects that no longer exist. But that would require read permissions for that controller on those objects.

johnbelamaric avatar Mar 11 '25 18:03 johnbelamaric

Another idea: perhaps this can be combined with binding conditions to implement gang scheduling?

  • We add claim.spec.bindingConditions and friends.
  • The app controller sets that when creating the claim.
  • The scheduler stalls binding pods like it does for binding conditions published by a driver.
  • Once all pods are pending on the binding condition, the app controller sets the binding condition in the claim status.
  • Scheduling of all pods completes.

The missing piece besides the new API is the detection of "all pods are pending on the binding condition". KEP #5007 currently doesn't specify and set pod conditions. Those may be needed - details TBD.

cc @dom4ha @KobayashiD27

pohly avatar Mar 12 '25 07:03 pohly

/cc

wojtek-t avatar Mar 12 '25 14:03 wojtek-t

Another idea: perhaps this can be combined with binding conditions to implement gang scheduling?

Super interesting idea. I hadn't thought of that. [btw, this is a win for the composability of features...]

How is it better / different than the existing support for gang scheduling?

I think this would work; I still think though that ultimately pod-by-pod scheduling is not ideal for solving the gang scheduling problem. I still hope we can work out a solution for a broader ecosystem of schedulers to more effectively share sub-node level resources, as we have discussed in other issues/KEPs/docs.

johnbelamaric avatar Mar 12 '25 16:03 johnbelamaric

Another idea: perhaps this can be combined with binding conditions to implement gang scheduling?

Super interesting idea. I hadn't thought of that. [btw, this is a win for the composability of features...]

How is it better / different than the existing support for gang scheduling?

I was also thinking about leveraging ResourceClaims as a form of grouping, but we probably should not require having shared ResourceClaims to enable gang scheduling. Sometimes individual claims are better, since they can get release when not used anymore.

I like however the original idea, because I think we should define more type of objects that are schedulable. By schedulable, I mean that there is a scheduler that knows how to schedule them and it does not have to be kube-scheduler. Schedulable objects do not necessarily have be Executable objects, as I could imagine ResourceClaim itself being schedulable as long as we'd have scheduler capable of scheduling it (I'm requesting a whole machine). Obviously scheduling of non-executable object means just allocating resources.

Also if we knew a schedulable object type, different gang-scheduling strategies could be used (LeaderWorkerSet should know which pod is a leader).

dom4ha avatar Mar 20 '25 18:03 dom4ha

I could imagine ResourceClaim itself being schedulable as long as we'd have scheduler capable of scheduling it (I'm requesting a whole machine).

I do believe we should make ResourceClaim schedulable (conceptually one can say they kind-of already are). I just don't think that necessarily implies the whole machine - I would like to schedule only a subset of it.

wojtek-t avatar Mar 21 '25 14:03 wojtek-t

/cc

ffromani avatar Mar 30 '25 16:03 ffromani

/assign

mortent avatar May 20 '25 16:05 mortent

In order to evaluate the different alternatives described in https://github.com/kubernetes/enhancements/issues/5194#issuecomment-2715318412, I think we need to understand how the ReservedFor field is being used. I tried to look through the code and I can find three ways this field is used:

  • The dynamicresources scheduler plugin:

    • It is used in PreFilter to make sure that if a pod is already reserved for a claim, it does not result in the pod being reported as unschedulable because the claim is already in use. There is no way to end up in this code path today since the CanBeReserved function always returns true.
    • It is used in PostFilter to locate claims that can be candidates for deallocation.
    • The field is also set in bindClaim (called from PreBind) and removed in Unreserve.
  • The ResourceClaim controller:

    • It is used to keep track of the pods referencing ResourceClaims generated from ResourceClaimTemplates so that they can be cleaned up.
  • The DeviceTaintEviction controller:

    • Uses the ReservedFor list to find the pods that needs to be evicted as a result of a NoExecute taint.

mortent avatar May 20 '25 23:05 mortent

There is also a sanity check in the kubelet which prevents running pods with claims where they are not listed in ReservedFor. This is to catch mistakes and already mitigated a potential vulnerability, see https://github.com/kubernetes/kubernetes/pull/131844.

pohly avatar May 21 '25 06:05 pohly

So looking more into this, a few things have come up that I want to clarify.

The resourceclaim controller currently stamps out a ResourceClaim per ResourceClaimTemplate reference for pods. This means that there will only be a single reference to each generated ResourceClaim. So do we need to use the ReservedFor list for cleaning up the generated ResourceClaims? My first thought is that the owner references should be sufficient, but there might be corner cases where relying only on those will fail?

The API allows the ReservedFor list to have references to arbitrary types, not just pods. But I think we currently only store pod references in the list. Do we have a use-case for storing references to other types than pods here? I was looking at the proposal in https://github.com/kubernetes-sigs/jobset/pull/853, which would add a ResourceClaimTemplate reference to a JobSet so that it can stamp out a ResourceClaim per Job. So it could add a reference to the Job to the ReservedFor list, but it seems like the only reason for doing so would be that it needs it for cleanup (ref previous paragraph).

And if there are references to types other than pods in the ReservedFor list, how will the kubelet and devicetainteviction controller handle those? Finding the pods that belong to an arbitrary type is challenging. We might assume that taints and evictions would need to be handled by other controllers, but there doesn't seem to be a way to handle this in the kubelet. Should it just skip the check if there are non-pod references?

mortent avatar May 28 '25 17:05 mortent

So do we need to use the ReservedFor list for cleaning up the generated ResourceClaims? My first thought is that the owner references should be sufficient, but there might be corner cases where relying only on those will fail?

Owner references can be set and unset by normal users. "Reserved for" is managed by the system. We shouldn't give users the opportunity to break resource tracking.

Do we have a use-case for storing references to other types than pods here?

Not at the moment. It was one of those forward-looking API decisions: if we ever need something else, we have the API ready for it. But it needs additional changes to work for pods which are referenced indirectly.

And if there are references to types other than pods in the ReservedFor list, how will the kubelet and devicetainteviction controller handle those?

They don't :expressionless:

pohly avatar May 30 '25 08:05 pohly

Not at the moment. It was one of those forward-looking API decisions: if we ever need something else, we have the API ready for it. But it needs additional changes to work for pods which are referenced indirectly.

So I think https://github.com/kubernetes-sigs/jobset/pull/853 is a use-case for this. It requires a different controller than the resourceclaim controller to create a ResourceClaim per Job. So the owner ref will be from the ResourceClaim to the Job, so as you mentioned, it means it will also need to have a reference to the Job in the ReservedFor list.

And if there are references to types other than pods in the ReservedFor list, how will the kubelet and devicetainteviction controller handle those?

They don't 😑

So thinking about this, are we essentially trying to use the ReservedFor field for two different things right now? There is the reference to the owning resource that will probably always (?) mirror the owner reference, and then there is keeping track of the pods that are currently using the ResourceClaim. If we ignore the size limit issue with the ReservedFor list for a moment, would it be useful to have separate fields for this? So we could have one field that mirrors the owner ref which would only be used by the controller responsible for stamping out ResourceClaims from ResourceClaimTemplates and manage their lifecycle, and then use the ReservedFor list only to keep track of the pods currently using the ResourceClaim. Although does feel a bit weird to have a field that will always mirror the owner refs..?

And to address the size limit issue with the ReservedFor field, it seems like we need a solution that doesn't require all pods to be listed in the ResourceClaim. And that means that any controller (like the device_taint_eviction controller) that needs to know the pods using a ResourceClaim would need to track them internally. I haven't looked into this in detail, but it seems like it would be doable for the device_taint_eviction controller, although it might be expensive. I'm a bit more uncertain about what should be the rules for deallocation, but the alternatives described in https://github.com/kubernetes/enhancements/issues/5194#issuecomment-2715318412 should apply I think.

mortent avatar May 30 '25 16:05 mortent

ReservedFor does one thing: "keeping track of the pods that are currently using the ResourceClaim."

That the same pod sometimes happens to be also the owner of the ResourceClaim is orthogonal. Not all claims owned by a pod are also necessarily reserved for it (now or ever) and not all claims reserved for it are owned by it.

The ownership is used by the garbage collector. ReservedFor is used by the ResourceClaim controller, scheduler, and the kubelet.

we need a solution that doesn't require all pods to be listed in the ResourceClaim

The problem then becomes the lack of atomicity for updates of unrelated objects. I think there was a proposal for atomic updates of the same type - perhaps that is something that we could leverage.

pohly avatar May 30 '25 18:05 pohly

ReservedFor does one thing: "keeping track of the pods that are currently using the ResourceClaim."

That makes sense. That there is a single purpose for this field makes it easier to think about. But is there any scenario where the references will be to something else than pods with this definition? I guess one could argue that a reference to workload resource like Deployment or StatefulSet would allow the controller to discover the pods belonging to those, but I don't think that is the same as using the ResourceClaim.

Based on your previous comment that the generic references here was a bit forward-looking, maybe it is best to think of the ReservedFor list only referencing pods?

That the same pod sometimes happens to be also the owner of the ResourceClaim is orthogonal. Not all claims owned by a pod are also necessarily reserved for it (now or ever) and not all claims reserved for it are owned by it.

So a pod using a ResourceClaim it doesn't own seems pretty normal and I can think of several examples of that. It is not as obvious how we would have a pod that owns a ResourceClaim, but doesn't use it, other than before allocation and after deallocation and for failure scenarios. But definitely something that can happen.

The ownership is used by the garbage collector. ReservedFor is used by the ResourceClaim controller, scheduler, and the kubelet.

Yeah, I think I misunderstood one of your previous comments here. Would it be correct to say that the resourceclaim controller will always be responsible for deallocation, which depends on the ReservedFor list, regardless of who is the owner of the ResourceClaim?

It would be interesting to implement a controller for something like https://github.com/kubernetes-sigs/jobset/pull/853 to see how it will work. I might look at it depending on the timeline for this.

we need a solution that doesn't require all pods to be listed in the ResourceClaim

The problem then becomes the lack of atomicity for updates of unrelated objects. I think there was a proposal for atomic updates of the same type - perhaps that is something that we could leverage.

Could you give more details on this?

I assume this means you think the ideas mentioned in https://github.com/kubernetes/enhancements/issues/5194#issuecomment-2715318412 also require us to address this issue?

mortent avatar May 30 '25 21:05 mortent

But is there any scenario where the references will be to something else than pods with this definition?

Let me clarify: the field tracks "consumers" of the claim, and those could theoretically be something else than pods. In practice, they are currently always pods. System components either ignore other entries (eviction controller, kubelet) or refuse to handle the claim (resourceclaim controller during deallocation).

maybe it is best to think of the ReservedFor list only referencing pods?

Yes, for now.

It is not as obvious how we would have a pod that owns a ResourceClaim, but doesn't use it

A user can create this. Why they would want that is open, but it's possible.

Would it be correct to say that the resourceclaim controller will always be responsible for deallocation, which depends on the ReservedFor list, regardless of who is the owner of the ResourceClaim?

Yes.

It would be interesting to implement a controller for something like https://github.com/kubernetes-sigs/jobset/pull/853 to see how it will work. I might look at it depending on the timeline for this.

The problem in this case will be kubelet: how does it know that pods of the jobset are allowed to use the claim?

Could you give more details on this?

Simple race: multiple pods share the same claim A. The scheduler adds a new pod during scheduling to object B which complements claim A, while concurrently the resourceclaim controller observes the termination of the last consumer of claim A and deallocates it. Now the pod is scheduled without a valid allocation.

pohly avatar May 31 '25 15:05 pohly

It would be interesting to implement a controller for something like kubernetes-sigs/jobset#853 to see how it will work. I might look at it depending on the timeline for this.

The problem in this case will be kubelet: how does it know that pods of the jobset are allowed to use the claim?

Looking at this again, the proposal here would have the controller patch the pod spec, meaning that the pods will reference the ResourceClaim in the normal way. I thought this proposal had overlap with this issue, but I no longer think that is true. There are some interesting/challenging aspects in the proposal, but it is probably best handled separately from this issue.

Could you give more details on this?

Simple race: multiple pods share the same claim A. The scheduler adds a new pod during scheduling to object B which complements claim A, while concurrently the ResourceClaim controller observes the termination of the last consumer of claim A and deallocates it. Now the pod is scheduled without a valid allocation.

So going back to the question of whether we can avoid listing all pods in the ReservedFor list, it seems like there are three challenges that will need to be addressed:

  • Handle the kubelet check that prevents pods from running with a claim unless they are listed in ReservedFor. The ideas in https://github.com/kubernetes/enhancements/issues/5194#issuecomment-2715318412 works around this by skipping this step in certain situations.
  • Handle deallocation of the ResourceClaim. We need a way to safely deallocate a ResourceClaim, reg your comment above. For the ideas in https://github.com/kubernetes/enhancements/issues/5194#issuecomment-2715318412 this is handled by leaving it up to a separate controller to signal this by removing the reference in the ReservedFor list.
  • Discover all pods using the claim for eviction. This wasn't discussed in https://github.com/kubernetes/enhancements/issues/5194#issuecomment-2715318412, but this is an important feature that needs to work especially for large workloads. I think there are two things the device_taint_eviction controller will need to do to handle this:
    • Find all pods referencing the claim.
    • Determine which of those pods are "using" the ResourceClaim. At least today with the ReservedFor list it is possible for a pod to have a reference to a ResourceClaim without (yet) being listed in ReservedFor. But it might be that is doesn't matter for eviction.

The ideas in https://github.com/kubernetes/enhancements/issues/5194#issuecomment-2715318412 adds non-pod references in the ReservedFor list. I think it can make sense to think of these as consumers of the ResourceClaim, in particular if it references a workload resource like Deployment or StatefulSet, where we can think of them as "owning" a set of pods. But they are also different than pods in important ways. Could it be useful to think of pods as direct consumers of a ResourceClaim and other types as more of an indirect consumer?

mortent avatar Jun 02 '25 20:06 mortent

Handle the kubelet check that prevents pods from running with a claim unless they are listed in ReservedFor. The ideas in https://github.com/kubernetes/enhancements/issues/5194#issuecomment-2715318412 works around this by skipping this step in certain situations.

I forgot to mention that the scheduler does the same check for pods referencing an allocated claim.

I think there are two things the device_taint_eviction controller will need to do to handle this: Find all pods referencing the claim. [...]

I suppose that can be done. It implies doing a bit more work, but it's all local in an informer cache.

Could it be useful to think of pods as direct consumers of a ResourceClaim and other types as more of an indirect consumer?

I think these consumers are all "direct" consumers, whether they are pods or something else. It's just the pods which sometimes consume the claims indirectly.

pohly avatar Jun 04 '25 09:06 pohly

Hi @johnbelamaric :wave:, v1.34 Enhancements team here.

This is just a friendly reminder of the upcoming PRR Freeze on Thursday 12th June 2025.

To help ensure that the PRR team has enough time to review your KEP before the Enhancements Freeze on Friday 20th June 2025, it is important that a PR is opened in k/enhancements by the June 12th deadline with the following:

  • The KEP's PRR questionnaire filled out.
  • The kep.yaml updated with the stage, latest-milestone, and milestone struct filled out.
  • A PRR approval file with the PRR approver listed for the stage the KEP is targeting.

Completing these items by the deadline is key for v1.34 release progression. For more information on the PRR process, see here.

Thanks for your continued work on this KEP! Your contributions are greatly appreciated, SIG Release v1.34 Enhancements Team

jmickey avatar Jun 05 '25 12:06 jmickey

Hello @johnbelamaric & @mortent 👋, v1.34 Enhancements team here.

Just checking in as we approach enhancements freeze at 21:00 UTC (14:00 PST) on Friday 20th June 2025.

This enhancement is targeting stage alpha for v1.34 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • [ ] KEP README using the latest template has been merged into the k/enhancements repo.
  • [ ] KEP status is marked as implementable for latest-milestone: "v1.34".
  • [ ] KEP README has up-to-date graduation criteria
  • [ ] KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here). If your production readiness review is not completed yet, please make sure to fill the production readiness questionnaire in your KEP by the PRR Freeze deadline on Thursday 12th June 2025_ so that the PRR team has enough time to review your KEP.

The status of this enhancement is marked as At risk for enhancements freeze. Please keep the issue description up-to-date with appropriate stages as well.

I do see that you have an open PR (#5379), I will be checking back in regularly in the lead up to the enhancements freeze and will keep this comment updated as things progress.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

Thanks for your continued work on this KEP! SIG Release v1.34 Enhancements Team

jmickey avatar Jun 10 '25 23:06 jmickey

/milestone v1.34

jmickey avatar Jun 11 '25 01:06 jmickey

Hey @johnbelamaric & @mortent! 👋

A friendly reminder again that Enhancements Freeze is approaching very soon at 21:00 UTC on Friday 20th June 2025.

I see there is active progress going for the PR, I will continue to check back leading up to the enhancements freeze in case there is any further progress.

The status of this enhancement is marked as At risk for enhancements freeze. If you anticipate missing enhancements freeze, you can file an exception request in advance.

Thank you!

jmickey avatar Jun 17 '25 18:06 jmickey

Hi @mortent 👋

Based on your comment here I am moving this KEP to status Deferred for the v1.34 release.

jmickey avatar Jun 20 '25 20:06 jmickey

/milestone clear

jmickey avatar Jun 20 '25 20:06 jmickey

There is a proposal to add a PodGroup or something like it to support Gang Scheduling, in #4671.
Each pod has an object reference to a PodGroup object. There could be thousands of pods in the PodGroup. It looks something like this:

apiVersion: scheduler.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: foo-pg
spec:
  minMember: 1000
---
kind: Job
apiVersion batch/v1
metadata:
    name: myjob2
    namespace: default
spec:
    replicas: 1000
    completions: 1000
    completionMode: "Indexed"
    resourceClaims:
         - name: logical-multi-node-resource
           resourceClaimName: lmnr-claim # <--- Yes, I really mean resourceClaim not resourceClaimTemplate.
    template:
        spec:
            podGroupRef: foo-pg   # <---- All pods refer to "foo-pg" showing that they are in the same group.
            container:
            # etc...

If all pods in the Gang want the same ResourceClaim, then maybe the PodGroup itself could make the ResourceClaim. That would look like this:

apiVersion: scheduler.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: foo-pg
spec:
    minMember: 1000
    resourceClaims:            # <-- Moved from pod to here. 
         - name: logical-multi-node-resource
           resourceClaimName: lmnr-claim 
---
kind: Job
apiVersion batch/v1
metadata:
    name: myjob2
    namespace: default
spec:
    replicas: 1000
    completions: 1000
    completionMode: "Indexed"
    resourceClaims:
         - name: logical-multi-node-resource
           resourceClaimName: lmnr-claim # <--- Yes, I really mean resourceclaim not resourcclaimtemplate.
    template:
        spec:
            podGroupRef: foo-pg   # <---- All pods refer to "foo-pg" showing that they are in the same group.
            container:
            # etc...
---

ResourceClaims that were attached to PodGroups would:

  1. be available to any pod in the pod group
  2. check their pool size against the size of the pod group (maybe podgroup promises a maximum size?)
  3. not need to list all pods allocated the claim. ReservedFor would just reference the PodGroup
  4. often not need per-node device configuration.

@johnbelamaric

The benefit of doing this over doing it for workloads is that PodGroup seems likely to be part of Kube-scheduler sooner than KubeScheduler will actually be informed of all kinds of workloads (Job, StatefulSet, etc).

erictune avatar Aug 25 '25 20:08 erictune

The benefit of doing this over doing it for workloads is that PodGroup seems likely to be part of Kube-scheduler sooner than KubeScheduler will actually be informed of all kinds of workloads (Job, StatefulSet, etc).

Yes, this is great. PodGroup is a more primitive group than workload, we probably don't need to do it for workload if we do it for podgroup, if we can translate a workload into a podgroup.

The tricky bit with ReservedFor Workload is backtracking from the workload to the Pods to the Devices, especially for tainting. @mortent would having it just be "pod" or "podgroup" simplify this?

johnbelamaric avatar Aug 25 '25 23:08 johnbelamaric

We are discussing on #4671 that PodGroups would be created for workloads that need Gang Scheduling. It looks like most workloads that need gang scheduling would be 1:1 Workload to PodGroup. However, LeaderWorkerSet are an exception: they would probably have N PodGroups per workload.
Also, workloads that don't need gang scheduling (classic cpu-based microservice deployment) would not have a PodGroup, at least as currently proposed.

erictune avatar Aug 26 '25 00:08 erictune

The main challenge with not having pods in the ReservedFor field is deallocation, i.e. when can the devices allocated to a ResourceClaim be released. Today this happens in two ways:

  • When the ResourceClaim is deleted.
  • When the resourceclaim-controller sees that a ResourceClaim with allocated devices have an empty ReservedFor field.

The latter situation becomes more complicated if we have references to non-Pod resources, since we get a race between scheduling of new pods using the ResourceClaim and deallocation.

I think the obvious way to leverage the PodGroup to solve this issue would be that we use it as the only non-Pod reference allowed in the ReservedFor field and that it will be the responsibility of either the podgroup-controller (or the resourceclaim-controller if we make it aware of the PodGroup resource) to add the reference to the PodGroup in the ReservedFor field and remove it once it is safe for the ResourceClaim to be deallocated. This is essentially the solution proposed in the KEP, but the PodGroup might simplify it if we only need to add support for this resource type.

But will a hypothetical podgroup-controller have information about creation of pods, i.e. will it be able to know when no more pods are running referencing a ResourceClaim and that no additional pods will be scheduled for the workload so the ResourceClaim can be safely deallocated? My impressions is that this will still be handled by the workload controllers. And I'm not sure if support for PodGroup in the scheduler will help with deallocation.

@johnbelamaric Finding the Pods for a particular PodGroup seems doable, but I'm not sure if it is any easier than just finding all pods that directly references a ResourceClaim. It seems like in both cases we need some kind of reverse index.

mortent avatar Sep 15 '25 17:09 mortent