java-operator-sdk icon indicating copy to clipboard operation
java-operator-sdk copied to clipboard

Handle configuration errors for Operators

Open pvary opened this issue 3 years ago • 9 comments

Feature Request

Working on the Flink Kubernetes Operator (https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/), and we would like to extend the operator with probes detecting the health of the deployment.

We have observed cases recently where even though the operator pod was running, it could not do its job due to missing rolebindings or misconfigured dynamic namespaces.

What did you do?

Deployed the Flink Kubernetes Operator

What did you expect to see?

Currently they just see that the operator itself is running, but the deployments / jobs are not created as expected.

What did you see instead? Under which circumstances?

I would like to help the users detect the issue that the Flink Kubernetes Operator is not working correctly from status of the operator

Environment

Not sure about the exact environment ATM Will try to collect the info

Java operator version: 3.0.3

Possible Solution

We are thinking about implementing probes (liveliness/readiness/statup)

Additional context

pvary avatar Jul 14 '22 13:07 pvary

This is more like a feature request not a bug. Copy pasting the initial thoughts from discord regarding how to check access right and relations to liveness probes:

Startup probes would be useful in case for event sources would take long to sync.

Thinking how it would be possible to automatically check if the service account has correct access rights. Probably this API would be best to use: https://kubernetes.io/docs/reference/access-authn-authz/authorization/#checking-api-access (I don't have XP with it).

However automatically determine what resources and what verbs are utilized by an operators is not that simple. It might be doable automatically for managed dependent resources (from the annotation), see: https://javaoperatorsdk.io/docs/dependent-resources#managed-dependent-resources

Or for a lower layer, in case we implement event source definition with annotations, and make some assumptions about those resources: https://github.com/java-operator-sdk/java-operator-sdk/issues/1298 (pls let us know what do you think about this issue)

Maybe an other way would be just to have an annotation that provides this information explicitly.

if there are more approaches we could take here, so currently we don't run a servlet container, and don't even want to in the core (probably). But rather could provide methods to query the operator for it's state, maybe just a method to checkAccessPermissionForReconciler Which could be utility outside of the operator (for this case now I lean to this, after 3 minutes of thinking 🙂 ) - so no running operator required. An another approach is just simply don't run the operator, e.g. exit the process, and create and event that explains the problem (missing create permission for a deployment for the operator)

csviri avatar Jul 14 '22 14:07 csviri

Note that some of this is handled in the Quarkus extension, which provides support for MicroProfile Health support using https://github.com/smallrye/smallrye-health. There's also automated creation of RBACs.

metacosm avatar Jul 14 '22 17:07 metacosm

There's also automated creation of RBACs.

But RBAC is just for dependent resources, right?

csviri avatar Jul 14 '22 18:07 csviri

Maybe an other way would be just to have an annotation that provides this information explicitly.

This is how it is done in go, but we could be more intelligent about this, since there are the annotations. dependent resource, event sources?, and maybe add some dedicated.

csviri avatar Jul 14 '22 18:07 csviri

There's also automated creation of RBACs.

But RBAC is just for dependent resources, right?

No, we try to generate as much as possible. Dependent resources help but even without them we generate some of the RBACs.

metacosm avatar Jul 15 '22 07:07 metacosm

So I see here more issues:

  1. generating the RBAC
  2. checking if required access rights are present on startup
  3. integration with the probes.

csviri avatar Jul 15 '22 07:07 csviri

Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… 😄

metacosm avatar Jul 15 '22 08:07 metacosm

Problem with detecting access rights at startup is that it's equivalent to be able to generate the RBACs because you basically need to check that the operator has actually the rights to perform all the API calls it's doing and if you can do that, well, you probably can generate the RBACs as well… smile

Yes, although, if generated not necessary means it is applied :)

But good question if both needed, or worth it to implement.

csviri avatar Jul 15 '22 08:07 csviri

The operator already fails on startup if it doesn't have access to the configured custom resources, I think this is almost good enough but we need some error hooks to react to any errors that happen later while the operator is running, or the namespaces are reconfigured etc.

That way it is straightforward to add a simple health check endpoint.

gyfora avatar Aug 15 '22 08:08 gyfora

Added this functionality in for v4.1: https://javaoperatorsdk.io/docs/patterns-best-practices#stopping-or-not-operator-in-case-of-informer-errors-and-cache-sync-timeouts

In #1594 a possibility is added to fine grain the liveness probe based on the health of the event sources / informers. These 2 should cover this issue too. We don't explicitly check the rbac, but based on issues with informers the behavior will be highly configurable. So after the PR is merged I intend to close this issue. So pls let me know if you see that there is additional functionality is needed.

csviri avatar Nov 08 '22 15:11 csviri

This is great, thank you @csviri

gyfora avatar Nov 08 '22 15:11 gyfora