structured-merge-diff `dry-run` sometimes misses metadata and causes `failed to prune fields` error during CRD conversion

Hello! We are developing a custom operator and utilizing fluxCD.

We have the v1alpha1 custom resource which is deployed by fluxCD. When we upgraded the custom resource operator from v1alpha1 to v1alpha2, flux notified us that dryrun failed with the following error message.

dry-run failed, error: failed to prune fields: failed add back owned items: failed to convert pruned object at version <foo.com>/v1alpha1: conversion webhook for <foo.com>/v1alpha2, Kind=<resource>  returned invalid metadata: invalid metadata of type <nil> in input object

Running the following dry-run command actually sometimes(about once in 5 times...?) fails with the same error message.

$ kubectl apply --server-side --dry-run=server  -f <v1alpha1-resource.yaml> --field-manager kustomize-controller

Error from server: failed to prune fields: failed add back owned items: failed to convert pruned object at version <foo.com>/v1alpha1: conversion webhook for <foo.com>/v1alpha2, Kind=<resource>  returned invalid metadata: invalid metadata of type <nil> in input object

~~But performing the actual conversion (the following command) never fails~~

performing the actual conversion also sometimes fails.

$ kubectl apply --server-side -f  <v1alpha1-resource.yaml> --field-manager kustomize-controller

<foo.com>/<resource> serverside-applied

The flakiness might be a key to solving this.

Our conversion code is similar to https://github.com/IBM/operator-sample-go/blob/b79e66026a5cc5b4994222f2ef7aa962de9f7766/operator-application/api/v1alpha1/application_conversion.go#L37

We checked the conversion log. Just one dryrun command called ConvertTo function for 3 times and ConvertFrom function for 3 times. For the last one time for each ConvertTo and ConvertFrom, we noticed that the request has lacking the information of metadata and spec when it fails. The error log is like "metadata":{"creationTimestamp":null},"spec":{} (The normal log is like "metadata":{"name":"<foo>","namespace":"<foo>","uid":"09b69792-56d5-4217-b23c-4d418d3f904b","resourceVersion":"1707796","generation":3,"creationTimestamp":"2022-09-16T07:28:54Z","labels":{"kustomize.toolkit.fluxcd.io/name":"<foo>","kustomize.toolkit.fluxcd.io/namespace":"flux-system"}},"spec":{"attribute1":[{...)

We could confirm that this weird thing happens when the managedField has two components(kustomization-controller and our-operator) as follows:

apiVersion: <foo.com>/v1alpha2
kind: <MyResource>
metadata:
  creationTimestamp: "2022-09-15T04:52:03Z"
  generation: 1
  labels:
    kustomize.toolkit.fluxcd.io/name: operator-sample
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  managedFields:
  - apiVersion: <foo.com>/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:kustomize.toolkit.fluxcd.io/name: {}
          f:kustomize.toolkit.fluxcd.io/namespace: {}
      f:spec:
        f:attribute1: {}
        f:attribute2: {}
    manager: kustomize-controller
    operation: Apply
    time: "2022-09-15T04:52:03Z"
  - apiVersion: <foo.com>/v1alpha2
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:attribute1: {}
        f:attribute2: {}
    manager: <our-operator>
    operation: Update
    time: "2022-09-15T04:52:04Z"
  name: v1alpha1-flux
  namespace: flux
  resourceVersion: "483157"
  uid: 696bed77-a12b-45d0-b240-8d685cf790e0

spec:
  ...
status:
  ...

I asked this question in the flux repo but I could not get the reason why. https://github.com/fluxcd/flux2/discussions/3105

I got stuck in this for more than one week and any ideas are really appreciated. Thanks!

Sep 21 '22 13:09 LittleWat

@kwiesmueller Sorry to mention. 🙇 I saw your TODO comment commit. https://github.com/kubernetes-sigs/structured-merge-diff/blob/26781d0c10bfdbd7d66b18d8be83985f623df9f8/merge/update.go#L193

Is this related to this issue...?

Sep 26 '22 01:09 LittleWat

I create a sample repo to reproduce this error. https://github.com/LittleWat/conversion-webhook-test-with-flux

I am glad if this repo is useful for debugging. Thank you!

Oct 07 '22 06:10 LittleWat

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 05 '23 07:01 k8s-triage-robot

/remove-lifecycle stale

Jan 10 '23 07:01 LittleWat

Also experiencing this exact issue.

Mar 24 '23 20:03 pnorth1

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 22 '23 20:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 19 '24 00:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Feb 18 '24 00:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 18 '24 00:02 k8s-ci-robot

structured-merge-diff structured-merge-diff copied to clipboard

`dry-run` sometimes misses metadata and causes `failed to prune fields` error during CRD conversion

structured-merge-diff
structured-merge-diff copied to clipboard