manifests icon indicating copy to clipboard operation
manifests copied to clipboard

Upgrade to use Kustomize v4+

Open zijianjoy opened this issue 4 years ago • 41 comments

From the README of kubeflow/manifests: https://github.com/kubeflow/manifests#prerequisites, it is a prerequisite to use Kustomize v3.2.0 for building resources. However, Kustomize v4+ is available. Can we upgrade prerequisite to use new version of Kustomize?

One thing that needs to be changed is that: flag name load_restrictor is changed in Kustomize v4+, we need to make the following updates in docs and build commands:

Change from --load_restrictor=none to --load-restrictor LoadRestrictionsNone. Note the difference between _ and -.

zijianjoy avatar Apr 05 '21 17:04 zijianjoy

@zijianjoy thanks a lot for opening this issue to switch our target version to latest Kustomize v4. This was not in our plan for 1.3.0, the reason being that we didn't want to change too many things together. But it can very well be in our plans for the point release or 1.4. Let's use this issue to track the effort.

yanniszark avatar Apr 05 '21 18:04 yanniszark

I am using Kustomize v4.0.5 for https://github.com/kubeflow-onprem/ArgoFlow and everything works as expected. The --load_restrictor flag is also not necessary except for the Katib manifests, which should be fixed and I have a PR open for this already. For the Argo installation I am not using load_restrictor (for Katib I am using my fixed version of the manifests) so this can be seen as proof that this flag is not necessary for deploying Kubeflow.

@zijianjoy One interesting anomaly I did encounter is that running kustomize build github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-multi-user-pns works for kustomize 3.2.1 and 4.0.5, but not for 3.9.4 (all the other manifests did not have any issue with 3.9.4, the default for Argo CD v2). There was a breaking change in how remote bases are grabbed in kustomize between v3 and v4 (s3 isn't supported anymore), so this is something to potentially keep in mind (not that I think we should update to an outdated version though).

davidspek avatar Apr 06 '21 19:04 davidspek

This issue burns me over and over. I am used to installing the latest version of tools and I ran into this again trying to work out the latest release configs for Azure.

As far as I can tell 3.9.2+ are broken with Knative also: kustomize build --load_restrictor=none common/knative/knative-eventing-install/base

errors with

... at path 'spec/template/metadata/labels': wrong Node Kind for spec.template.metadata.labels expected: MappingNode was AliasNode: value: {*ref_0}

The latest version that works for me is 3.9.1 or 3.8.10. I installed and tested about 20 different versions.

berndverst avatar Apr 07 '21 02:04 berndverst

FYI @DavidSpek with 4.0.5 Knative is also broken as is:

 ~/src/kubeflow/manifests-1.3-branch:  kustomize build --load-restrictor LoadRestrictionsNone common/knative/knative-eventing-install/base
wrong Node Kind for spec.template.metadata.labels expected: MappingNode was AliasNode: value: {*ref_0}

@yanniszark I really think this is a release blocking issue because neither the latest 3.X version nor any 4.X versions of kustomize work. On the one hand we are telling people that kfdef is no longer the recommend tool, instead use kustomize, on the other hand we don't support any modern versions of kustomize.

berndverst avatar Apr 07 '21 02:04 berndverst

@yanniszark with the PR to remove pointers from the Knative YAML everything seems to be working fine.

I can generate everything with v4.0.5 by running from the example folder you provided:

kustomize build --load-restrictor LoadRestrictionsNone .

berndverst avatar Apr 07 '21 04:04 berndverst

@berndverst You are correct. I had forgotten about this one but I had already created a PR to fix this in the KNative manifests 4 days ago.

https://github.com/kubeflow/manifests/pull/1795

Also, you don't need the --load-restrictor LoadRestrictionsNone flag so you shouldn't be using it as it disables a security check in kustomize.

davidspek avatar Apr 07 '21 06:04 davidspek

@DavidSpek I noticed I apparently created 2 PRs that were dupes of yours when they were marked as dupes this morning sigh. At least we are getting those issues addressed now!

berndverst avatar Apr 07 '21 15:04 berndverst

@DavidSpek I found I needed the load restrictor flag for katib when I tried this yesterday. But I saw that this is also being addressed somewhere.

berndverst avatar Apr 07 '21 15:04 berndverst

@berndverst Indeed, Katig should be the only one. I am working with @andreyvelich to remove this requirement in https://github.com/kubeflow/katib/pull/1498.

On my branch with these fixes you can run: kustomize build github.com/davidspek/katib/manifests/v1beta1/installs/katib-with-kubeflow-cert-manager?ref=fix-manifests And it will build the manifests directly, no need to download them locally.

davidspek avatar Apr 07 '21 16:04 davidspek

I want to add another data point that sticking to kustomize 3.2.0 is not a good idea.

See https://github.com/kubernetes-sigs/kustomize/issues/1500#issuecomment-768454468, starting from kubectl release 1.21, it will come with kustomize v3.9. In not far future, people will be able to use a relatively new version of kustomize from kubectl. That can be an onboarding experience improvement if kubectl is the only tool people need.

I think the urgency isn't so much considering the speed people upgrade kubernetes clusters, it could be enough that we start supporting later kustomize versions starting from Kubeflow 1.4.

Bobgy avatar Apr 08 '21 04:04 Bobgy

@berndverst You are correct. I had forgotten about this one but I had already created a PR to fix this in the KNative manifests 4 days ago.

#1795

Also, you don't need the --load-restrictor LoadRestrictionsNone flag so you shouldn't be using it as it disables a security check in kustomize.

@DavidSpek I want to remind you why we decided to use load restrictor none, read https://bit.ly/kf_kustomize_v3. It was introduced mainly because we want people to build reusable patches that people can compose, instead of relying on overlays (which is inheritance model and people cannot compose).

So far, I think https://kubectl.docs.kubernetes.io/guides/config_management/components/ is a feature designed to solve this problem -- allow reusable composition of kustomization transformation without turning off security check.

but I haven't used it much and I don't see much of the community using it, I wonder how people think about it.

Bobgy avatar Apr 08 '21 04:04 Bobgy

@Bobgy I understand why the load-restrictor was useful, but currently only 1 component still needs it (and those manifests are better suited with overlays if you ask me). Indeed the components feature is something I ran into as well, but this cannot be used at this point because for some reason kustomize 3.2.1 is being used (the load restrictor can be using with newer version, and all the manifests build with newer versions). The only reason the version of kustomize matters at this point is because the load-restrictor flag has changed in newer versions, and this flag is being used for all the manifests even though only the Katib manifests needs it (for now). If the load-restrictor flag is removed from the instructions, the kustomize version can also be removed.

davidspek avatar Apr 08 '21 13:04 davidspek

Good point! Then I agree it's very useful fixing katib and let everyone stop using the flag, that makes manifest compatible between v3 and v4.

Bobgy avatar Apr 08 '21 23:04 Bobgy

@Bobgy For reference, here is the PR for updating the Katib manifests: https://github.com/kubeflow/katib/pull/1498.

davidspek avatar Apr 09 '21 09:04 davidspek

@berndverst we have merged all linked PRs necessary in order to use kustomize 4.0.5. However, I have bumped into what seems a blocking issue in latest kustomize. The latest kustomize version orders admission webhooks (Validating/MutatingWebhookConfiguration) last. This means that the installation is now prone to the following race condition:

  1. Pods are created before the istio injection webhook, thus they don't get a sidecar.
  2. Apps that integrate with CentralDashboard like KFP UI fail because they don't have a sidecar.

I tested the current example kustomization and it actually failed. Did you also encounter this problem? I will open an upstream kustomize issue on Monday.

yanniszark avatar Apr 09 '21 23:04 yanniszark

@yanniszark I think we should better break up installation to several steps to avoid the race condition in the first place. There is no guarantee how long each webhook server takes to start either.

Bobgy avatar Apr 10 '21 00:04 Bobgy

I've deployed the manifests manually with Kustomize 4.0.5 as well and have only run into the chicken & egg situation a few times and rerunning the command (instantly) like what is stated in the current instructions solved it.

@yanniszark It looks like there is an existing discussion about the ordering, see https://github.com/kubernetes-sigs/kustomize/issues/821. In particular this comment and the following one are about the ordering of MutatingWebhookConfigurations.

davidspek avatar Apr 10 '21 10:04 davidspek

@yanniszark I think we should better break up installation to several steps to avoid the race condition in the first place. There is no guarantee how long each webhook server takes to start either.

True, but this is not a problem with the current single command installation right? Because if a webhook Pod is not up yet, then kubectl exits with non-zero code and is retried.

I've deployed the manifests manually with Kustomize 4.0.5 as well and have only run into the chicken & egg situation a few times and rerunning the command (instantly) like what is stated in the current instructions solved it.

@DavidSpek you are referring to the scenario described by @Bobgy right? That is, trying to create an object while its registered webhook Service does not have any ready Pods yet.

In the scenario I described, I believe retrying won't help, as the Pods need to be recreated in order to be injected with the Istio sidecar.

@yanniszark It looks like there is an existing discussion about the ordering, see kubernetes-sigs/kustomize#821. In particular this comment and the following one are about the ordering of MutatingWebhookConfigurations.

Yeap! I am currently going through the relevant kustomize PRs / issues and I will open an upstream issue to discuss.

yanniszark avatar Apr 12 '21 11:04 yanniszark

@Bobgy @yanniszark The PR to make Katib’s manifests not need the load-restrictor (and some other enhancements/optimizations) has just been merged: https://github.com/kubeflow/katib/pull/1498. It does still need to be picked into the current release branch. After this is done, the load-restrictor shouldn’t be needed for any of the Kubeflow manifests anymore and v3/v4 should both work without issue. I’ve been deploying everything with 4.0.5 and without the load-restrictor flag.

davidspek avatar Apr 13 '21 07:04 davidspek

@yanniszark I think we are talking about the same problem, if istio sidecars are not injected, retry doesn't help.

So I suggest separating steps as necessary -- e.g. Istio setup should finish before anything may depend on it. It's ok to not separate steps for race conditions that can be resolved via retry.

Right now, it seems we need two steps:

  1. apply istio & its dependencies
  2. apply other stuff

Bobgy avatar Apr 13 '21 08:04 Bobgy

@Bobgy for this:

I think we should better break up installation to several steps to avoid the race condition in the first place

We already have the step-by-step instructions. From the later message, I see that you are talking about just 2 steps, instead of a full step-by-step installation. This is definitely a possible workaround, but there might still be some webhook configurations that are skipped because of kustomize's new ordering, so we need to look into that carefully if we decide to go down this way.

In addition, I have opened an upstream issue in kustomize and I'm testing a fix: https://github.com/kubernetes-sigs/kustomize/issues/3794

yanniszark avatar Apr 13 '21 11:04 yanniszark

Just for reference, I've not seen this issue with the Istio sidecars not being injected and causing problems when deploying with Argo CD. I'm not sure if this is just luck, or if Argo CD is handling this is some way behind the scenes.

davidspek avatar Apr 13 '21 12:04 davidspek

Are manifests going to revert to referencing the 'upstream' manifests through URL's for the next release or is this planned for further in the future? This would drastically reduce the overhead of pulling in all the upstream manifests into this repository, it would simply be a link

connorlwilkes avatar Apr 13 '21 13:04 connorlwilkes

Are manifests going to revert to referencing the 'upstream' manifests through URL's for the next release or is this planned for further in the future? This would drastically reduce the overhead of pulling in all the upstream manifests into this repository, it would simply be a link

That is something that should be discussed for after the 1.3 release. @connorlwilkes would you like to open an issue to propose this change and the benefits / drawbacks? cc @DavidSpek who has also used this method in Argoflow. Since this issue is for using kustomize v4 and the issues blocking us from doing so, let's spin this discussion in a separate issue.

yanniszark avatar Apr 13 '21 13:04 yanniszark

@yanniszark @connorlwilkes After the release I was also planning on starting this discussion. I think this will be particularly useful once I have Renovate properly configured for everything, as the only upkeep would then be merge auto-created PRs. Distributions would then simply for the repo and add their customizations to it, similarly to how people are using my Argo installation now. But indeed, this is off the topic of this issue and should be discussed somewhere else.

davidspek avatar Apr 13 '21 13:04 davidspek

@yanniszark I've just done a deployment with Argo CD on a fresh Equinix Metal cluster and Istio was failing to deploy as I hadn't set the necessary api-server flags. I see the following event with all the components related to the above conversation sidecar injection and their deployments are failing because of this, but after Istio comes up the deployments come up as well:

Error creating: Internal error occurred: failed calling webhook "sidecar-injector.istio.io": Post "https://istiod.istio-system.svc:443/inject?timeout=30s": dial tcp 10.98.192.83:443: connect: connection refused

davidspek avatar Apr 14 '21 13:04 davidspek

@DavidSpek thanks for the report. I don't think it's relevant to the issue described here (https://github.com/kubeflow/manifests/issues/1797#issuecomment-817017412) right?

yanniszark avatar Apr 14 '21 13:04 yanniszark

@berndverst I wanted to ask, since you mentioned in your email that:

With that change the latest kustomize 4.0.5 seems to be working for me

Did you successfully deploy with kustomize 4.0.5 with the 1-command install and check that the web apps all work? I would have expected you to bump into https://github.com/kubernetes-sigs/kustomize/issues/3794

yanniszark avatar Apr 14 '21 15:04 yanniszark

@yanniszark The load_restrictor flag should probably be removed from the README now that the Katib manifests PR has been merged.

davidspek avatar Apr 14 '21 15:04 davidspek

@yanniszark The load_restrictor flag should probably be removed from the README now that the Katib manifests PR has been merged.

Thanks! Want to create a quick PR? I can review/merge.

yanniszark avatar Apr 14 '21 15:04 yanniszark