crossplane icon indicating copy to clipboard operation
crossplane copied to clipboard

Add an `--enable-cached-compositions` feature flag

Open negz opened this issue 1 year ago • 2 comments

Description of your changes

Fixes https://github.com/crossplane/crossplane/issues/5338

When this flag is enabled we'll cache *Unstructured objects.

I have:

  • [x] Read and followed Crossplane's contribution process.
  • [x] Run make reviewable to ensure this PR is ready for review.
  • [ ] ~Added or updated unit tests.~
  • [x] Added or updated e2e tests.
  • [ ] Linked a PR or a docs tracking issue to document this change.
  • [ ] ~Added backport release-x.y labels to auto-backport this PR.~

Need help with this checklist? See the cheat sheet.

negz avatar Feb 05 '24 23:02 negz

Looks like this fails E2Es pretty consistently. I'd vote we don't try squeeze it into 1.15.

negz avatar Feb 06 '24 14:02 negz

Totally agree, @negz.

phisco avatar Feb 06 '24 14:02 phisco

Hey @negz I tinkered with this PR and I feel like there might be a problem with cache expiration time with *Unstructured objects or maybe a type problem when we try to get a non *Unstructured type but try to get it from the *Unstructured cache reader. Here is why think like this:

  1. Github Actions for E2E tests passes the first test TestCompositionMinimal but fails the second one TestCompositionPatchAndTransform. When I ran tests locally generally it passes the first but sometimes this is not the case. So it made me think that maybe the cache expires at an arbitrary time which causes this behavior.
  2. When I tried to run tests one by one using GoLand's test running tool, tests before TestCompositionFunctions passed. I also ran tests consecutively using go test with GoLand while running Crossplane as a process to ensure that EnableAlphaCachedCompositions is True and again the same happened.
  3. When tests fail crossplane's logs almost everytime has a variety of this log so I thought maybe we have some problem with the cache reader types but I am not sure:
2024-04-29T00:12:06+03:00       DEBUG   crossplane      cannot get claim  {"controller": 
"offered/compositeresourcedefinition.apiextensions.crossplane.io", 
"request": {"name":"xnopresources.nop.example.org"}, "uid": "1d0ca97c-d11f-4153-b360-7c7173172f2e", "version": "1126",
"name": "xnopresources.nop.example.org", "controller": "claim/xnopresources.nop.example.org", "request"":
{"name":"apiextensions-composition-functions","namespace":"default"}, "error": "NopResource.nop.example.org 
\"apiextensions-composition-functions\" not found"}

I am still working on this PR to better understand and solve the issue.

enesonus avatar Apr 28 '24 22:04 enesonus

Hi @negz I think I understood the problem that is causing our E2E tests to fail. The inconsistent behavior was not because of cache expiration but early read operation.

Investigating the problem first of all I looked at this blog, it was routing to a new solution that is merged to controller-runtime but this was deprecated. Then I looked at cluster-api project as a source and found out that they create different clients for normal and unstructured caching but this approach did not succeeded in our codebase unfortunately.

As in my last comment I looked for any expiration time or cache flush etc. inside controller-runtime codebase but I could not find anything about that. I also tried to increase memory but this was also unsuccesful. Since after we cant find a resource we do not requeue and status stays as 'not found'. We also do not reconcile again if we do not get any events i.e. status stays 'not found' and causing tests to fail. I created this https://github.com/crossplane/crossplane/pull/5644 draft PR that reads from the API server directly if we cant find the resource from the cache.

enesonus avatar May 01 '24 20:05 enesonus

Per my comment at https://github.com/crossplane/crossplane/pull/5644#discussion_r1589840058 I realized this is way more subtle and complex than I had remembered.

Notably we can't enable caching of Unstructured resources without also finding some way to garbage collect the cache's informers when their GVK no longer exists (e.g. due to their CRD being deleted). If we don't, the informer will live forever complaining that it can't list its GVK.

negz avatar May 06 '24 19:05 negz