structured-merge-diff
structured-merge-diff copied to clipboard
`dry-run` sometimes misses metadata and causes `failed to prune fields` error during CRD conversion
Hello! We are developing a custom operator and utilizing fluxCD.
We have the v1alpha1 custom resource which is deployed by fluxCD.
When we upgraded the custom resource operator from v1alpha1 to v1alpha2, flux notified us that dryrun failed with the following error message.
dry-run failed, error: failed to prune fields: failed add back owned items: failed to convert pruned object at version <foo.com>/v1alpha1: conversion webhook for <foo.com>/v1alpha2, Kind=<resource> returned invalid metadata: invalid metadata of type <nil> in input object
Running the following dry-run command actually sometimes(about once in 5 times...?) fails with the same error message.
$ kubectl apply --server-side --dry-run=server -f <v1alpha1-resource.yaml> --field-manager kustomize-controller
Error from server: failed to prune fields: failed add back owned items: failed to convert pruned object at version <foo.com>/v1alpha1: conversion webhook for <foo.com>/v1alpha2, Kind=<resource> returned invalid metadata: invalid metadata of type <nil> in input object
~~But performing the actual conversion (the following command) never fails~~
performing the actual conversion also sometimes fails.
$ kubectl apply --server-side -f <v1alpha1-resource.yaml> --field-manager kustomize-controller
<foo.com>/<resource> serverside-applied
The flakiness might be a key to solving this.
Our conversion code is similar to https://github.com/IBM/operator-sample-go/blob/b79e66026a5cc5b4994222f2ef7aa962de9f7766/operator-application/api/v1alpha1/application_conversion.go#L37
We checked the conversion log. Just one dryrun command called ConvertTo function for 3 times and ConvertFrom function for 3 times. For the last one time for each ConvertTo and ConvertFrom, we noticed that the request has lacking the information of metadata and spec when it fails.
The error log is like "metadata":{"creationTimestamp":null},"spec":{}
(The normal log is like "metadata":{"name":"<foo>","namespace":"<foo>","uid":"09b69792-56d5-4217-b23c-4d418d3f904b","resourceVersion":"1707796","generation":3,"creationTimestamp":"2022-09-16T07:28:54Z","labels":{"kustomize.toolkit.fluxcd.io/name":"<foo>","kustomize.toolkit.fluxcd.io/namespace":"flux-system"}},"spec":{"attribute1":[{...)
We could confirm that this weird thing happens when the managedField has two components(kustomization-controller and our-operator) as follows:
apiVersion: <foo.com>/v1alpha2
kind: <MyResource>
metadata:
creationTimestamp: "2022-09-15T04:52:03Z"
generation: 1
labels:
kustomize.toolkit.fluxcd.io/name: operator-sample
kustomize.toolkit.fluxcd.io/namespace: flux-system
managedFields:
- apiVersion: <foo.com>/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
f:kustomize.toolkit.fluxcd.io/name: {}
f:kustomize.toolkit.fluxcd.io/namespace: {}
f:spec:
f:attribute1: {}
f:attribute2: {}
manager: kustomize-controller
operation: Apply
time: "2022-09-15T04:52:03Z"
- apiVersion: <foo.com>/v1alpha2
fieldsType: FieldsV1
fieldsV1:
f:status:
f:attribute1: {}
f:attribute2: {}
manager: <our-operator>
operation: Update
time: "2022-09-15T04:52:04Z"
name: v1alpha1-flux
namespace: flux
resourceVersion: "483157"
uid: 696bed77-a12b-45d0-b240-8d685cf790e0
spec:
...
status:
...
I asked this question in the flux repo but I could not get the reason why. https://github.com/fluxcd/flux2/discussions/3105
I got stuck in this for more than one week and any ideas are really appreciated. Thanks!
@kwiesmueller Sorry to mention. 🙇 I saw your TODO comment commit. https://github.com/kubernetes-sigs/structured-merge-diff/blob/26781d0c10bfdbd7d66b18d8be83985f623df9f8/merge/update.go#L193
Is this related to this issue...?
I create a sample repo to reproduce this error. https://github.com/LittleWat/conversion-webhook-test-with-flux
I am glad if this repo is useful for debugging. Thank you!
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Also experiencing this exact issue.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.