Validation to ensure that the CRD exists – Handling the non-existent CRD scenario.
Describe the bug When the CRD doesn't exist, the controller errors out. This behavior is expected from the application. However, from a platform's perspective, there are scenarios where CRDs are enabled by platform operators, and controllers are deployed by ACK controller admins. Though an edge case, there could be instances where the CRDs enabled by platform operators lag behind the controller version, or scenarios where platform operators only want to enable certain versions on the platform. In this situation, the controller would remain in an error state since it wouldn't find the required CRD.
For example, consider the ElastiCache release version v0.0.29, which didn't have the CacheCluster CR, only added in version v0.1.0.
{"level":"error","ts":"2024-08-21T13:57:27.722Z","logger":"controller-runtime.source.EventHandler","msg":"if kind is a CRD, it should be installed before calling Start","kind":"CacheCluster.elasticache.services.k8s.aws","error":"no matches for kind \"CacheCluster\" in version \"elasticache.services.k8s.aws/v1alpha1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:63\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:87\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:88\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:56"}
Steps to reproduce
Expected outcome Is there a possibility to add an ignore CRD check in the runtime?
Environment
- Kubernetes version 1.28
- Using EKS (yes/no), if so version? yes
- AWS service targeted (S3, RDS, etc.): We observed this on the Elasticache controller but it's valid for other controllers as well.
Hi!
What is the difference with #2007 ? So this ticket is enhancing that one.
Is there a possibility to add an ignore CRD check in the runtime?
It means that then we need to disable handling of particular crds in controller... otherwise it will silently ignore some crds and the operator of cluster won't know that some functions are not ... let's say enabled.
Also I want to say that from my perspective it would be a very rare case when somebody (not platform operator!) should install ACK toolkit. It looks like that only platform operators should install the ACK operators in the cluster and follow strict procedure (CRDs before, the operator itself after).
Hi @rahtr ,
The controller-runtime package creates watch streams for resources it needs to manage. When initializing these watches during controller startup, it directly queries the Kubernetes API server to establish subscriptions for specific resource kinds. If a CRD does not exist, this fundamental initialization step fails, preventing the controller from starting properly.
if kind is a CRD, it should be installed before calling Start","kind":"CacheCluster.elasticache.services.k8s.aws","error":"no matches for kind \"CacheCluster\
Controller-runtime is designed with the assumption that all CRDs must be available before controller initialization. This behavior is actually by design, as confirmed by a maintainer in a GitHub issue about controller recovery from missing CRDs:
"This timeout was deliberately added, because before that, it would just silently not work"
Previously, controllers would silently fail without indicating the actual problem, making troubleshooting difficult. The current behavior with explicit errors and timeouts makes the problem obvious.
We plan to support something—it would be great to get community feedback—about providing a scope to only have reconcilers for the CRDs that the admin decides to enable. However, that would not address this specific case since the issue arises because the admins would not be aware of the CRDs in the platform teams have enable. This edge case will remain.
what do you think @a-hilaly @michaelhtm @gecube ?
A --reconcile-resources flag that allows operators to specify which resource kinds should be actively reconciled, enabling more efficient resource utilization and operational control. When provided with a comma-separated list of resource kinds (e.g., "Queue,Topic"), the controller will only create reconcilers for those specific resources. This enhancement maintains backward compatibility by continuing to reconcile all resources when the flag is not specified.
https://github.com/aws-controllers-k8s/runtime/pull/176
/close
@rushmash91: Closing this issue.
In response to this:
A
--reconcile-resourcesflag that allows operators to specify which resource kinds should be actively reconciled, enabling more efficient resource utilization and operational control. When provided with a comma-separated list of resource kinds (e.g., "Queue,Topic"), the controller will only create reconcilers for those specific resources. This enhancement maintains backward compatibility by continuing to reconcile all resources when the flag is not specified.https://github.com/aws-controllers-k8s/runtime/pull/176
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.