strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

ClusterRoleBindings for `kafka-init` to `strimzi-kafka-broker` role may not be removed

Open MikeEdgar opened this issue 2 years ago • 16 comments

Describe the bug When a Kafka has been deleted during reconciliation, the delete reconciliation event may time-out before completion of an in-process add/update. As a result, the delete logic ([1] and [2]) may not run, leaving behind the CRB for kafka-init.

To Reproduce Steps to reproduce the behavior:

  1. Create a Kafka resource
  2. After the reconciliation has started and the kafka-init CRB has been created, delete the Kafka resource
  3. Updating the (deleted) resource status will fail and the kafka-init CRB remains

Expected behavior Removal of a Kafka resource removes all resources associated with the Kafka, include any cluster-level resources.

Environment (please complete the following information):

  • Strimzi version: 0.26.0
  • Installation method: OLM bundle
  • Kubernetes cluster: OpenShift 4.9
  • Infrastructure: OSD

YAML files and logs

Will be provided

Additional context [1] https://github.com/strimzi/strimzi-kafka-operator/blob/e0ed9de80a23f0df577f5db18eb319360b1e361f/operator-common/src/main/java/io/strimzi/operator/common/AbstractOperator.java#L261-L272 [2] https://github.com/strimzi/strimzi-kafka-operator/blob/e0ed9de80a23f0df577f5db18eb319360b1e361f/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaAssemblyOperator.java#L660-L664

MikeEdgar avatar Jun 22 '22 16:06 MikeEdgar

This is pretty much intentional - or maybe not intentional but known. For global scoped resources, garbage collection cannot be used (because owner reference s allowed only within the same namespace). So Kubernetes will not delete it. And deleting it in 100% cases would be fairly complicated and troublesome. So the current state when it can in some situations leak is based on a informed decision.

scholzj avatar Jun 22 '22 17:06 scholzj

Understood about the ownership of global resources. I was looking to explore the additional complication for removing it in cases like this. Is the thinking that whatever is deleting the Kafka resource should be aware of this and remove the CRB itself? I'm not opposed to something like that, but if a nice solution can be in the Kafka operator somehow, it would certainly be preferred.

MikeEdgar avatar Jun 22 '22 17:06 MikeEdgar

Well, you have in general two options:

  • Use finalizers to block the deletion and delete everything properly when it comes to it. Finalizers are a bit dangerous and can easily screw up things.
  • Monitor the resources in the reconciliation and delete them later even if the initial deletion failed.

Both seemed like a relatively lot of effort given the actual value it would bring.

scholzj avatar Jun 22 '22 19:06 scholzj

What are your thoughts about somehow prioritizing DELETED events in general? For example, a reconciliation for the DELETED action might wait for the reconciliation lock without a timeout (or a longer timeout) since it's a terminal event and a subsequent event for the resource would never be received. It certainly wouldn't cover 100% of any potential leakage, but may reduce it significantly enough to be worthwhile.

MikeEdgar avatar Jun 23 '22 18:06 MikeEdgar

My thoughts would be that its not easy to achieve given the current design - and since it anyway does not help on 100%, it is questionable what the value is.

scholzj avatar Jun 23 '22 18:06 scholzj

I was thinking that it would work to add an OptionalLong lockTimeout to the Reconciliation object which would be set to some higher value for DELETED events. DELETED would move to its own case in OperatorWatcher. Then AbstractOperator#withLock would use reconciliation.lockTimeout().orElse(lockTimeoutMs) to acquire the lock.

The benefit is that it keeps these cleanup events in the scope of the operator for most scenarios. The one I can think of that would be an issue is an operator restart while the DELETED event is waiting. While not 100%, it increases the likelihood of success I think. If it seems reasonable, I'd be glad to put together a PR.

MikeEdgar avatar Jul 08 '22 10:07 MikeEdgar

Another "best effort" would be to set the Namespace of the Kafka instance in the ownerReferences of the cluster role binding. That would allow it to be garbage collected when the namespace is removed.

MikeEdgar avatar Aug 02 '22 12:08 MikeEdgar

I guess that would be an option. But I guess we would need to read the namespaces to get the namespace UID for the owner reference. o we would need additional global scoped RBAC which means a lot of complications in installation and lot of questions about why is it needed and if it can be disabled.

scholzj avatar Aug 02 '22 13:08 scholzj

we would need to read the namespaces to get the namespace UID for the owner reference. o we would need additional global scoped RBAC which means a lot of complications in installation and lot of questions about why is it needed and if it can be disabled.

Sure, I can appreciate that. I was looking at using the CRB template, but it seems the additional metadata properties are not mapped to the created CRB. Was it left out intentionally or could that be added?

MikeEdgar avatar Aug 02 '22 14:08 MikeEdgar

Sorry, not sure WDYM with CRB templae or which metadata properties are missing. Can you provide more details?

scholzj avatar Aug 02 '22 14:08 scholzj

In the KafkaClusterTemplate there is a ResourceTemplate for CRB creation. https://github.com/strimzi/strimzi-kafka-operator/blob/db412a742d96e2e0709e16abec097bfba7fa06b0/api/src/main/java/io/strimzi/api/kafka/model/template/KafkaClusterTemplate.java#L51

ResourceTemplate and the contained MetadataTemplate both contain a Map of additional properties, but currently only the annotations and labels are retrieved from the custom resource: https://github.com/strimzi/strimzi-kafka-operator/blob/db412a742d96e2e0709e16abec097bfba7fa06b0/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaCluster.java#L435-L438

Then, they're mapped to the CRB: https://github.com/strimzi/strimzi-kafka-operator/blob/db412a742d96e2e0709e16abec097bfba7fa06b0/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/AbstractModel.java#L1708-L1718

I'm suggesting that the additional properties be mapped in the same way. The CRD also may need to be updated to preserve them. I suppose it would be best to follow the same pattern for the other templates that map labels/annotations.

MikeEdgar avatar Aug 02 '22 15:08 MikeEdgar

The additionalProperties field is there when decoding a JSON / YAML which contains some unknown fields. It is not something what you can use to set some additional properties. It is basically just an internal trashcan where any insupported field will be end up and will be ignored. They work exactly as they were designed to work.

scholzj avatar Aug 02 '22 16:08 scholzj

They are serialized as well and actually take precedence over same-named fields in the resource. It seems on the surface like a convenient vehicle for a client to pass arbitrary metadata, with the understanding/warning that they take full responsibility for what they're doing...

MikeEdgar avatar Aug 02 '22 18:08 MikeEdgar

Triaged on 18.8.2022: This is a valid issue. We just need to find a solution which does not cause more problems then the initial issue.

scholzj avatar Aug 18 '22 14:08 scholzj

Thank you. Perhaps another option along the lines of my last comments may be to add a List<OwnerReference> to MetadataTemplate. That would allow clients to more cleanly set the namespace (or other cluster-scoped resource) as the owner of the CRB as necessary.

Also, I do think there is still value in discussing if/how the "delete" event can always be processed. Particularly for things like this that can't/won't be handled by GC.

MikeEdgar avatar Aug 18 '22 15:08 MikeEdgar

Well, TBH, we triaged the original issue about the deletion of the bindings. Not any enhancements to the template fields. I do not think we plan any owner reference configuration there.

scholzj avatar Aug 18 '22 15:08 scholzj