java-operator-sdk
java-operator-sdk copied to clipboard
Handle configuration errors for Operators
Feature Request
Working on the Flink Kubernetes Operator (https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/), and we would like to extend the operator with probes detecting the health of the deployment.
We have observed cases recently where even though the operator pod was running, it could not do its job due to missing rolebindings or misconfigured dynamic namespaces.
What did you do?
Deployed the Flink Kubernetes Operator
What did you expect to see?
Currently they just see that the operator itself is running, but the deployments / jobs are not created as expected.
What did you see instead? Under which circumstances?
I would like to help the users detect the issue that the Flink Kubernetes Operator is not working correctly from status of the operator
Environment
Not sure about the exact environment ATM Will try to collect the info
Java operator version: 3.0.3
Possible Solution
We are thinking about implementing probes (liveliness/readiness/statup)
Additional context
This is more like a feature request not a bug. Copy pasting the initial thoughts from discord regarding how to check access right and relations to liveness probes:
Startup probes would be useful in case for event sources would take long to sync.
Thinking how it would be possible to automatically check if the service account has correct access rights. Probably this API would be best to use: https://kubernetes.io/docs/reference/access-authn-authz/authorization/#checking-api-access (I don't have XP with it).
However automatically determine what resources and what verbs are utilized by an operators is not that simple. It might be doable automatically for managed dependent resources (from the annotation), see: https://javaoperatorsdk.io/docs/dependent-resources#managed-dependent-resources
Or for a lower layer, in case we implement event source definition with annotations, and make some assumptions about those resources: https://github.com/java-operator-sdk/java-operator-sdk/issues/1298 (pls let us know what do you think about this issue)
Maybe an other way would be just to have an annotation that provides this information explicitly.
if there are more approaches we could take here, so currently we don't run a servlet container, and don't even want to in the core (probably). But rather could provide methods to query the operator for it's state, maybe just a method to checkAccessPermissionForReconciler Which could be utility outside of the operator (for this case now I lean to this, after 3 minutes of thinking 🙂 ) - so no running operator required. An another approach is just simply don't run the operator, e.g. exit the process, and create and event that explains the problem (missing create permission for a deployment for the operator)
Note that some of this is handled in the Quarkus extension, which provides support for MicroProfile Health support using https://github.com/smallrye/smallrye-health. There's also automated creation of RBACs.
There's also automated creation of RBACs.
But RBAC is just for dependent resources, right?
Maybe an other way would be just to have an annotation that provides this information explicitly.
This is how it is done in go, but we could be more intelligent about this, since there are the annotations. dependent resource, event sources?, and maybe add some dedicated.
There's also automated creation of RBACs.
But RBAC is just for
dependent resources, right?
No, we try to generate as much as possible. Dependent resources help but even without them we generate some of the RBACs.
So I see here more issues:
- generating the RBAC
- checking if required access rights are present on startup
- integration with the probes.
Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… 😄
Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… smile
Yes, although, if generated not necessary means it is applied :)
But good question if both needed, or worth it to implement.
The operator already fails on startup if it doesn't have access to the configured custom resources, I think this is almost good enough but we need some error hooks to react to any errors that happen later while the operator is running, or the namespaces are reconfigured etc.
That way it is straightforward to add a simple health check endpoint.
Added this functionality in for v4.1: https://javaoperatorsdk.io/docs/patterns-best-practices#stopping-or-not-operator-in-case-of-informer-errors-and-cache-sync-timeouts
In #1594 a possibility is added to fine grain the liveness probe based on the health of the event sources / informers. These 2 should cover this issue too. We don't explicitly check the rbac, but based on issues with informers the behavior will be highly configurable. So after the PR is merged I intend to close this issue. So pls let me know if you see that there is additional functionality is needed.
This is great, thank you @csviri