helm icon indicating copy to clipboard operation
helm copied to clipboard

pkg/kube: remove managed fields before SSA

Open stevekuznetsov opened this issue 3 months ago • 14 comments

Server-side apply patches cannot contain managed field data; we must zero-out this field before sending the patch.

$ helm status --namespace hypershift hypershift --show-desc
NAME: hypershift
LAST DEPLOYED: Tue Oct  7 08:28:25 2025
NAMESPACE: hypershift
STATUS: failed
REVISION: 1
DESCRIPTION: Release "hypershift" failed: metadata.managedFields must be nil && metadata.managedFields must be nil && metadata.managedFields must be nil && metadata.managedFields must be nil && metadata.managedFields must be nil
TEST SUITE: None

On install (my logger, Helm code):

[08:09:10.093] DEBUG: Error patching resource {
  "error": "metadata.managedFields must be nil",
  "gvk": "apiextensions.k8s.io/v1, Kind=CustomResourceDefinition",
  "name": "routes.route.openshift.io",
  "namespace": ""
}

stevekuznetsov avatar Oct 07 '25 14:10 stevekuznetsov

Should we add a non regression test?

benoittgt avatar Oct 07 '25 18:10 benoittgt

@robertsirc sure thing! Do y'all already have an end-to-end setup somewhere? We need a k8s API server running to reproduce the issue, I don't think there's any unit or integration level test that will catch this.

stevekuznetsov avatar Oct 13 '25 12:10 stevekuznetsov

Is this because the chart is trying to set managedFields directly? (or can you please describe how to reproduce so I can understand this is occurring please)

gjenkins8 avatar Oct 13 '25 19:10 gjenkins8

@gjenkins8 no, this is just on a Helm upgrade using v4 + SSA on a chart created by v3 install/upgrade, naively I would expect any codepath that triggers the patching flow from v4 to hit this, since the managed fields are provided always.

stevekuznetsov avatar Oct 13 '25 19:10 stevekuznetsov

@robertsirc sure thing! Do y'all already have an end-to-end setup somewhere? We need a k8s API server running to reproduce the issue, I don't think there's any unit or integration level test that will catch this.

If I may, we can mock this up in this TestPatchResourceServerSide test and test the negative there?

robertsirc avatar Oct 14 '25 14:10 robertsirc

I think you would be much better off testing the positive "real server accepts this SSA request" since there are a lot of rough edges - testing individual rough edges in a unit test by checking for the negative is going to give you poor coverage until you accumulate enough of them for it to be useful, and some of the field manager conflict bits will be impossible to test in that manner.

stevekuznetsov avatar Oct 14 '25 14:10 stevekuznetsov

@gjenkins8 no, this is just on a Helm upgrade using v4 + SSA on a chart created by v3 install/upgrade, naively I would expect any codepath that triggers the patching flow from v4 to hit this, since the managed fields are provided always.

In this case, it is expected that CSA managed fields are removed by:

# https://github.com/helm/helm/blob/main/pkg/kube/client.go#L738-L739
if updateOptions.upgradeClientSideFieldManager {
	patched, err := upgradeClientSideFieldManager(original, updateOptions.dryRun, updateOptions.fieldValidationDirective)
	...
# https://github.com/helm/helm/blob/main/pkg/kube/client.go#L1042-L1045
patchData, err := csaupgrade.UpgradeManagedFieldsPatch(
    info.Object,
    sets.New(fieldManagerName),
    fieldManagerName)

Can you describe fully how to reproduce an issue please! A naive helm3 -> helm4 upgrade works for me.

gjenkins8 avatar Oct 15 '25 13:10 gjenkins8

Hmm, sorry - looks like this was more likely adoption of existing CRDs or something. Notably, whatever weird state I ended up in was due to the other bug I found (failure to load old releases due to a bug in time encoding, recently fixed), so it's not clear if an end user would end up in this state. Here's a trace I found in my scrollback from after I fixed the managed fields issue:

[08:56:52.469] DEBUG: preparing upgrade {
  "name": "hypershift"
}
[08:56:52.469] DEBUG: getting last revision {
  "name": "hypershift"
}
[08:56:52.469] DEBUG: getting release history {
  "name": "hypershift"
}
[08:56:52.670] DEBUG: getting deployed releases {
  "name": "hypershift"
}
[08:56:52.914] DEBUG: number of dependencies in the chart {
  "dependencies": 0
}
metadata: map[APIVersion:v2 Annotations:map[] AppVersion: Condition: Dependencies:[] Deprecated:false Description:A Helm chart to install the Hypershift Operator and deps for ARO Home: Icon: Keywords:[] KubeVersion: Maintainers:[] Name:aro-hcp-hypershift-operator Sources:[] Tags: Type:application Version:0.1.0]
[08:56:52.915] DEBUG: determined release apply method {
  "previous_release_apply_method": "ssa",
  "server_side_apply": true
}
[08:56:53.002] DEBUG: performing update {
  "name": "hypershift"
}
[08:56:53.109] DEBUG: creating upgraded release {
  "name": "hypershift"
}
[08:56:53.109] DEBUG: creating release {
  "key": "sh.helm.release.v1.hypershift.v2"
}
[08:56:54.048] DEBUG: using server-side apply for resource update {
  "dryRun": false,
  "fieldValidationDirective": "Strict",
  "forceConflicts": true,
  "upgradeClientSideFieldManager": false
}
[08:56:54.048] DEBUG: checking resources for changes {
  "resources": 3
}
[08:56:55.559] DEBUG: Patched resource {
  "gvk": "/v1, Kind=ServiceAccount",
  "name": "hypershift-installer",
  "namespace": "hypershift"
}
[08:56:55.671] DEBUG: Patched resource {
  "gvk": "/v1, Kind=Secret",
  "name": "pull-secret",
  "namespace": "hypershift"
}
[08:56:58.995] DEBUG: Patched resource {
  "gvk": "rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding",
  "name": "hypershift-installer-cluster-role",
  "namespace": ""
}
[08:56:58.995] DEBUG: deleting resource {
  "kind": "CustomResourceDefinition",
  "name": "podmonitors.monitoring.coreos.com",
  "namespace": ""
}
[08:57:00.145] DEBUG: deleting resource {
  "kind": "CustomResourceDefinition",
  "name": "prometheusrules.monitoring.coreos.com",
  "namespace": ""
}
[08:57:00.217] DEBUG: deleting resource {
  "kind": "CustomResourceDefinition",
  "name": "routes.route.openshift.io",
  "namespace": ""
}
[08:57:00.299] DEBUG: deleting resource {
  "kind": "CustomResourceDefinition",
  "name": "servicemonitors.monitoring.coreos.com",
  "namespace": ""
}
[08:57:00.392] DEBUG: waiting for resources {
  "count": 3,
  "timeout": 300000000000
}

I'll go out on a limb and say that deleting CRDs as part of an upgrade likely shows we're well outside of normal execution flow here ... since I don't have a nice reproducer we might as well close this out.

stevekuznetsov avatar Oct 15 '25 17:10 stevekuznetsov

We've reproduced the CRD deletion ....

stevekuznetsov avatar Oct 22 '25 19:10 stevekuznetsov

We are deploying this chart. Running install the first time while opting into CRDs here, then running an upgrade while omitting CRDs here is sufficient to reproduce the issue.

stevekuznetsov avatar Oct 22 '25 19:10 stevekuznetsov

Looks like at least this one was my error: https://github.com/helm/helm/pull/31413

stevekuznetsov avatar Oct 23 '25 00:10 stevekuznetsov

We are deploying this chart. Running install the first time while opting into CRDs here, then running an upgrade while omitting CRDs here is sufficient to reproduce the issue.

ah, thanks!

I wonder if this is related to SSA? The above seems like Helm is (incorrectly) deleting CRDs if "SkipCRDs" is set. Seemingly because they are being omitted from the list of resources to be "updated". And the update reconciliation object is translating that into a delete operation: https://github.com/helm/helm/blob/main/pkg/kube/client.go#L555

gjenkins8 avatar Oct 23 '25 00:10 gjenkins8

I wonder if this is related to SSA?

Could you confirm this is unrelated to SSA please! (And so my understanding is correct; I want/need to fix any SSA issues)

gjenkins8 avatar Oct 27 '25 15:10 gjenkins8

@gjenkins8 that particular CRD issue was from my use of IncludeCRDs.... extremely misleading name, sent another PR for that. I will get you a reproducer log trace for this particular issue (invalid managed fields) shortly

stevekuznetsov avatar Nov 04 '25 13:11 stevekuznetsov

@gjenkins8 that particular CRD issue was from my use of IncludeCRDs.... extremely misleading name, sent another PR for that. I will get you a reproducer log trace for this particular issue (invalid managed fields) shortly

Thanks! Please do. I want/need to get fixes in before next week. But unsure if there is an issue here.

gjenkins8 avatar Nov 09 '25 12:11 gjenkins8