training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

feat: Add the manager field to the podTemplateOverride object

Open kaisoz opened this issue 3 months ago โ€ข 24 comments

What this PR does / why we need it:

This PR adds a manager field to the podTemplateOverrides object, indicating who created (owns) the object. This is useful for external controllers such as Kueue to identify which entries they added. This manager field is optional and is defaulted by a mutating webhook, which sets the request username for podTemplateOverrides that donโ€™t have a manager.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #2856

Checklist:

  • [ ] Docs included if any changes are user facing

kaisoz avatar Dec 04 '25 22:12 kaisoz

Pull Request Test Coverage Report for Build 21167022709

Details

  • 14 of 25 (56.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.1%) to 51.495%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/webhooks/trainjob_webhook.go 14 19 73.68%
pkg/util/cert/cert.go 0 6 0.0%
<!-- Total: 14 25
Totals Coverage Status
Change from base Build 21161934063: 0.1%
Covered Lines: 1257
Relevant Lines: 2441

๐Ÿ’› - Coveralls

coveralls avatar Dec 04 '25 23:12 coveralls

cc @tenzen-y @andreyvelich @astefanutti @mimowo

I also tested this manually in kind by deploying a Trainjob with a podTemplateOverride entry in a cluster with Kueue installed. The mutating webhook added manager=kubernetes-admin to my podTemplateOverride object, and manager=system:serviceaccount:kueue-system:kueue-controller-manager (the Kueue service account) to the ones added by Kueue.

kaisoz avatar Dec 04 '25 23:12 kaisoz

The failing test seems to be a flake (Failed executing command with error: build process: clone golangci-lint: git clone --branch v2.4.0 --single-branch --depth 1 -c advice.detachedHead=false -q https://github.com/golangci/golangci-lint.git: exit status 128) Could anyone re-run it for me? Thanks!

kaisoz avatar Dec 04 '25 23:12 kaisoz

/retest

astefanutti avatar Dec 05 '25 07:12 astefanutti

/retest

I thought I couldn't do this here since the repo uses GH actions... ๐Ÿคฆ๐Ÿป

kaisoz avatar Dec 05 '25 07:12 kaisoz

/retest

kaisoz avatar Dec 05 '25 09:12 kaisoz

@kaisoz: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Dec 05 '25 09:12 google-oss-prow[bot]

/retest

kaisoz avatar Dec 05 '25 09:12 kaisoz

@kaisoz: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Dec 05 '25 09:12 google-oss-prow[bot]

@kaisoz: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Dec 05 '25 09:12 google-oss-prow[bot]

@kaisoz: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

@astefanutti prow doesn't trust me ๐Ÿ˜† . I did run the clone line in machine and it works, I expect it to work now in CI..

kaisoz avatar Dec 05 '25 09:12 kaisoz

/ok-to-test

/restest

astefanutti avatar Dec 05 '25 09:12 astefanutti

/retest

kaisoz avatar Dec 07 '25 17:12 kaisoz

@kaisoz there is an issue with CI that should be fixed after you rebase this on top of #3023.

astefanutti avatar Dec 08 '25 09:12 astefanutti

@kaisoz there is an issue with CI that should be fixed after you rebase this on top of #3023.

great @astefanutti ! Thanks! Then I think I'll wait for #3023 to be merged and then rebase ๐Ÿ˜Š

kaisoz avatar Dec 08 '25 10:12 kaisoz

@kaisoz #3023 has been merged if you want to rebase on top.

astefanutti avatar Dec 09 '25 07:12 astefanutti

/retest

kaisoz avatar Dec 09 '25 10:12 kaisoz

@astefanutti all green now! PTAL whenever you have time. Thanks! ๐Ÿ™‚

kaisoz avatar Dec 09 '25 10:12 kaisoz

@kaisoz that looks good to me overall, thanks!

I understand the webhook would automatically set the manager field to "mimic" SSA behavior, but just to be sure we don't introduce too much complexity nor "re-invent" the wheel, would having Kueue set the manager field directly in the PodTemplateOverride be working just as good?

Hey @astefanutti thanks for the review! Yes! it does mimic SSA but a more fine-grained entry level. And I agree, Kueue can just set the manager field directly, but having a mutating webhook adds some benefits:

  1. Allows the manager field to be optional. Users don't need to manually set the manager, the webhook does it automatically
  2. Allows to set +listMap=map with +listMapKey=manager so that entries are modified individually (also comes with better observability because managedFields will show what entries were modified instead of the whole list)

2 needs a stable key, that's where the webhook comes into play (it ensures that the field will never be empty). However, I just thought that for better integrity, maybe the manager field should always be set by the webhook so that a user can't claim a controller identity...

WDYT? If you still think it's too much I can drop the webhook but I still think it adds some benefits ๐Ÿ˜Š

kaisoz avatar Dec 10 '25 11:12 kaisoz

@kaisoz thanks, that makes sense. The complexity is reasonable enough and it has some benefits as you've highlighted.

@andreyvelich @tenzen-y do you have any opinions?

astefanutti avatar Dec 11 '25 12:12 astefanutti

I like the idea of defaulting the manager to req.UserInfo.Username if it is empty, however will it introduce any problems in the future if we decide to make PodTemplateOverrides mutable for TrainJob?

andreyvelich avatar Jan 13 '26 20:01 andreyvelich

I like the idea of defaulting the manager to req.UserInfo.Username if it is empty, however will it introduce any problems in the future if we decide to make PodTemplateOverrides mutable for TrainJob?

I might be missing something, but PodTemplateOverrides are already mutable when a TrainJob is suspended. Is there an additional kind of mutability youโ€™re referring to here?

With this implementation, manager is only defaulted at creation time if it is unset. On updates, the existing value is preserved. That said, this still allows any user to explicitly change manager and claim an entry that was originally set by a controller. To avoid this, one option would be to make the field immutable and disallow changes after creation.

kaisoz avatar Jan 15 '26 21:01 kaisoz

I might be missing something, but PodTemplateOverrides are already mutable when a TrainJob is suspended. Is there an additional kind of mutability youโ€™re referring to here?

Do we know if Kueue requires to have this field mutable for suspended TrainJobs?

andreyvelich avatar Jan 16 '26 15:01 andreyvelich

Do we know if Kueue requires to have this field mutable for suspended TrainJobs?

Yes, before resuming a job, Kueue injects its scheduling info via PodTemplateOverrides. With this change, Kueue would either set the appropriate manager on each newly added entry, or leave it empty so the webhook can default it to its system account.

kaisoz avatar Jan 19 '26 20:01 kaisoz

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from tenzen-y. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Jan 19 '26 21:01 google-oss-prow[bot]