crossplane icon indicating copy to clipboard operation
crossplane copied to clipboard

Realtime composition rendering spinning with valid resources when not using `mode: Pipeline`

Open chlunde opened this issue 7 months ago • 12 comments

What happened?

On a test cluster with 1.20.0-rc.1 I see a couple of compositions "spinning", reconciling a resource in a tight loop.

provider-aws tags

In one case there's an old community provider-aws version with arrray based tags:

    tags:
    - key: kubernetes_namespace
      value: demoimage

the provider then instantly writes back:

    - key: crossplane-name
      value: app-7x459-m86pn
    - key: crossplane-providerconfig
      value: default
    - key: crossplane-kind
      value: repository.ecr.aws.crossplane.io

This itself is not a new issue, but it didn't spin this quick before. I think there should be some kind of protection against this as the CloudTrail, GuardDuty and EKS audit log bills will be huge for anyone on AWS running into this and similar issues.

provider-kubernetes apiVersion / conversion

Another issue where I see a similar issue, but with unknown cause, is a provider-kubernetes resource where the composition uses v1alpha1 instead of v1alpha2.

How can we reproduce it?

apiVersion: example.com/v1alpha1
kind: Spin
metadata:
  name: spinner
  namespace: crossplane-system
spec: {}
---
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: compositespins.example.com
spec:
  group: example.com
  names:
    kind: CompositeSpin
    plural: compositespins
  claimNames:
    kind: Spin
    plural: spins
  versions:
  - name: v1alpha1
    served: true
    referenceable: true
    schema:
      openAPIV3Schema:
        type: object
        properties: {}
---
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: compositespins.example.com
spec:
  writeConnectionSecretsToNamespace: crossplane-system
  compositeTypeRef:
    apiVersion: example.com/v1alpha1
    kind: CompositeSpin
  resources:
    - base:
        apiVersion: kubernetes.crossplane.io/v1alpha1
        kind: Object
        spec:
          forProvider:
            manifest:
              apiVersion: v1
              kind: ConfigMap
              metadata:
                namespace: crossplane-system

What environment did it happen in?

Crossplane version: 1.20.0-rc.1

chlunde avatar May 18 '25 20:05 chlunde

I don't know if it's the same issue but I can get the non-realtime composition function reconciler to spin as well - I believe it's hitting this line: https://github.com/crossplane/crossplane/blob/main/internal/controller/apiextensions/composite/composition_functions.go#L525 and the controller is not backing off like it should. Applying an invalid resource has many possible causes, and this one appears to not be handled properly and the controller will constantly reconcile until the condition resolves itself. I was going to open a separate issue for it, but it may be related to this scenario.

bobh66 avatar May 19 '25 01:05 bobh66

@bobh66 in both cases here the resources are valid, so I don't think my cases should hit that line.

chlunde avatar May 19 '25 05:05 chlunde

@bobh66 in both cases here the resources are valid, so I don't think my cases should hit that line.

Thanks, I'll open a separate issue.

bobh66 avatar May 19 '25 12:05 bobh66

community provider-aws function approach

Crossplane community provider-aws has a couple of issues with tags:

  • The provider auto-generates tags, which are not available to the composition. This means that if the composition sets tags, they are deleted via the kubernetes API for every reconcile. This triggers a reconcile of the managed resource, which adds them back, but that means we also reconcile with AWS for every crossplane reconcile.
  • Ordering can change, which also triggers reconciles.

So we could have a function that:

  • Sorts tags to have a consistent ordering
  • Copies crossplane- prefixed tags from observed to desired state

Do you think such a function might belong to crossplane-contrib? :thinking_face:

I happen to have written one, but I never actually tried it.

SSA approach

Another approach now that we have SSA, would be for the community provider to use proper markers for tags:

//+listType	map
//+listMapKey	key

https://kubernetes.io/docs/reference/using-api/server-side-apply/#merge-strategy

But either way I don't think it should reconcile this fast, so this will not be a complete fix for this issue. Also, it does not explain the provider-kubernetes thing.

chlunde avatar May 19 '25 15:05 chlunde

Can function-tag-manager help?

bobh66 avatar May 19 '25 15:05 bobh66

@bobh66 I think I looked at it for another use case, extracting "team" from a namespace label to a tag for all resources, and noticed that it didn't array based tags / community tags. If that can be fixed, then maybe we could use that.

chlunde avatar May 19 '25 16:05 chlunde

Updated reproducer:

  • Needs to use provider kubernets v1alpha1 (maybe because that triggers the conversion webhook)
  • Must use resources: and not mode: Pipeline

chlunde avatar May 19 '25 16:05 chlunde

Re community provider, the SSA approach seems to work well!

chlunde avatar May 19 '25 19:05 chlunde

It's also spinning for me. But we're using pipeline mode. It happens when we try to deploy Objects for provider-kubernetes. We're deploying them with the correct v1alpha2 version.

Also, there's nothing actually changing between the reconciles except the generation and resourceVersion. Any ideas how I could debug this?

Kidswiss avatar May 20 '25 12:05 Kidswiss

I wrote down some kubectl/jq commands I used to discover how many mode: Resources compositions we had and some migration tips: https://github.com/crossplane/crossplane/discussions/6477

chlunde avatar May 20 '25 17:05 chlunde

Just dropping some thoughts from @negz here for future reference, with light edits for context:

The general problem here seems to be an XR and a provider fighting over desired state, resulting in frequent reconciliations. Yes, this can get worse with watches from realtime compositions.

To deal with this general scenario, we'd essentially have to apply rate limiting to regular non-error reconciles. Last time we checked, that's hard to do with controller runtime because it assumes it never needs to rate limit a reconcile unless it returns an error or Requeue: true, which this scenario is not.

There will be more to think through here for a good general solution 🤓

jbw976 avatar May 20 '25 18:05 jbw976

It's also spinning for me. But we're using pipeline mode. It happens when we try to deploy Objects for provider-kubernetes. We're deploying them with the correct v1alpha2 version.

Also, there's nothing actually changing between the reconciles except the generation and resourceVersion. Any ideas how I could debug this?

I found the issue: the way we serialized the Objects cause the inner manifest to have a creationDate: nil. This caused the reconcile loop.

EDIT: I agree with the above statement by @jbw976 it's currently very easy to introduce changes that can lead to reconcile loops, and then it's not easy to debug and find what actually causes the issues...

Kidswiss avatar May 20 '25 19:05 Kidswiss

There is a new priority queue implementation in controller-runtime: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/controller/priorityqueue/priorityqueue.go#L54 but it is at 1.21 of controller runtime, we are currently pinned at 1.19.

n3wscott avatar Jun 24 '25 22:06 n3wscott