kcp icon indicating copy to clipboard operation
kcp copied to clipboard

feature: rebase to Kubernetes 1.32.0

Open embik opened this issue 1 year ago • 10 comments

Feature Description

Kubernetes 1.32.0 is slated to come out tomorrow. We should rebase kcp to that new version. A final changelog isn't out yet, so it likely makes sense to wait for that to understand what changes to expect from the rebase.

Proposed Solution

Follow the rebase guide.

Alternative Solutions

No response

Want to contribute?

  • [ ] I would like to work on this issue.

Additional Context

No response

embik avatar Dec 10 '24 09:12 embik

/assign gman0

embik avatar Dec 10 '24 12:12 embik

@embik: GitHub didn't allow me to assign the following users: gman0.

Note that only kcp-dev members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign gman0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kcp-ci-bot avatar Dec 10 '24 12:12 kcp-ci-bot

Taking (I would self-/assign but that probably won't work for me at the moment because of perms).

gman0 avatar Dec 10 '24 12:12 gman0

/assign gman0

embik avatar Dec 10 '24 12:12 embik

WIP branches for kcp-dev/kubernetes:

  • https://github.com/gman0/kubernetes/tree/kcp-1.32-baseline
  • https://github.com/gman0/kubernetes/tree/kcp-1.32-pre.1-deps
  • https://github.com/gman0/kubernetes/tree/kcp-1.32-pre.2-fixes

gman0 avatar Jan 27 '25 13:01 gman0

WIP branch for kcp-dev/kcp:

  • https://github.com/gman0/kcp/tree/1.32-bump
  • https://github.com/gman0/kcp/tree/1.32-bump-pre.1

Also working on updating the rebasing docs:

  • https://github.com/gman0/kcp/tree/update-rebase-docs

gman0 avatar Jan 27 '25 13:01 gman0

https://github.com/gman0/kubernetes/commit/3ecc3a49cd2572a1bc9404d5d1aed30e8bf25fc6

TL;DR we may need to wire in a context to the asPartialObjectMetadataList function. Will try it out and we can discuss it once the PR is out.

gman0 avatar Jan 27 '25 13:01 gman0

https://github.com/gman0/kubernetes/commit/0f81a9ee821009644411367a519df3571011e580

The new changes in upstream's GC controller may cause some issues for us. I'll try to explain the issue on garbagecollector.Sync because we don't have the new version of ResyncMonitors just yet, and the hack/kcp/garbage_collector_patch.go generator just copy-pastes that anyway.

  • New: https://github.com/kubernetes/kubernetes/blob/v1.32.0/pkg/controller/garbagecollector/garbagecollector.go#L175-L246
  • Old: https://github.com/kubernetes/kubernetes/blob/v1.31.0/pkg/controller/garbagecollector/garbagecollector.go#L175-L283

Main changes:

  • The new version no longer has locks for the workers ((GarbageCollector struct).workerLock doesn't exist anymore). It seems garbagecollector.Sync now relies on retrying and hoping things will be in sync eventually.
  • Observe that the new version no longer retries gc.resyncMonitors && cache.WaitForNamedCacheSync inside wait.PollImmediateUntilWithContext . Again, relying on global retries inside the parent wait.UntilWithContext loop.

Consequences:

  • If we (kcp) fail in gc.resyncMonitors or cache.WaitForNamedCacheSync, it seems like we won't be able to retry, only until the next LogicalCluster event in GC controller that would invoke ResyncMonitors.
  • Lack of worker locks means we may miss to garbage-collect some resources and would find out only until the next LogicalCluster event.

Now what?

  • Why are we not reconciling like upstream does, with a periodic re-sync? Is it too expensive? If not, shouldn't we also implement some time-based resync (as a replacement or addition to event-based?)
  • If the answer to the bullet point above is No, one alternative would be to try to revert upstream's https://github.com/kubernetes/kubernetes/commit/e8b1d7dc24713db99808028e0d02bacf6d48e01f just to keep the old behavior until we figure out a better way to handle polling (or some other way to trigger GC).
  • However, the commit above fixes important bugs, and we are likely interested in to including it too:
    • https://github.com/kubernetes/kubernetes/issues/101078
    • https://github.com/kubernetes/kubernetes/issues/127105

What we've decided:

  • We're going with the lock-full version, https://github.com/gman0/kubernetes/commit/28ac217b93a7b49caa5e7382c5dd01919834df7b.
    • This is to protect GC from missing events.
    • TODO: let's work toward lock-less version

gman0 avatar Jan 27 '25 13:01 gman0

https://github.com/gman0/kcp-code-generator/commit/e65054c73f6315664236041ef63e850fd217a808

The build constraint !ignore_autogenerated in templates caused the generated files to be not picked up by tools in https://github.com/kubernetes/code-generator/tree/master/cmd, specifically when re-generating deepcopy functions as a part of hack/update-codegen.sh.

Removing these constraints fixed the issue.

TODO:

  • It's not yet clear why with this worked, with the constraints, when publishing our k8s rebase to v1.31.0. Generating the same v1.31.0 code now fails however (could this be some external dependency that was updated?).
  • The build constraint seems to have been introduced already as a part of the very first PR: https://github.com/kcp-dev/code-generator/pull/1/files#diff-8f8bddc9a9b92d14d0cbdd05deaace70831ec9de0879b4b5b1bd99bc5f41a81eR21-R22 . I still need to check more thoroughly why it was there in the first place.

gman0 avatar Jan 27 '25 14:01 gman0

https://github.com/gman0/kcp/commit/ddbe24723a5ad38771f700da7d73334dfdf166d5

This commit removes v1alpha1 ValidatingAdmissionPolicy from informers and APIExport builtin types.

It seems it was added in 1f2244a0594660c98bbf5f4c0dfeaafe2315e0eb , c136c4ab01e6fd3570806b55cfd0c88e7707173e and 666c6a3f7f90c6d26dee0412889591c445ce56d3 . This was back in 2023 when we were basing kcp on k8s 1.26, and at that time VAP was indeed just being introduced, and so it was served from admissionregistration.k8s.io/v1alpha1. Since v1alpha1 APIs are disabled by default, c136c4ab01e6fd3570806b55cfd0c88e7707173e added a runtime config override:

https://github.com/kcp-dev/kcp/blob/4bd4122483ff418a58315389b4f872dcaec0fe7c/pkg/server/options/options.go#L133-L136

With k8s v1.32 rebase however v1alpha1 marks VAP as removed in its APILifecycle policy:

https://github.com/kubernetes/kubernetes/blob/v1.32.0/staging/src/k8s.io/api/admissionregistration/v1alpha1/zz_generated.prerelease-lifecycle.go#L108-L112

This makes it so that the kind is not served. Because of that it was necessary to remove the v1alpha1 VAP code for our 1.32 rebase, otherwise it was causing issues with our informers:

I0214 00:47:29.166169 1915080 deleted_kinds.go:169] Removing resource [validatingadmissionpolicybindings.v1alpha1.admissionregistration.k8s.io](http://validatingadmissionpolicybindings.v1alpha1.admissionregistration.k8s.io/) because it is time to stop serving it per APILifecycle.

gman0 avatar Feb 17 '25 09:02 gman0

Hi folks, reviving this issue with the latest updates:

There were only minor changes compared to https://github.com/kcp-dev/kubernetes/pull/161

gman0 avatar Apr 28 '25 07:04 gman0