kopf icon indicating copy to clipboard operation
kopf copied to clipboard

KOPF passive pod does not keep in-memory indexes

Open cjbaar opened this issue 3 years ago • 3 comments

Long story short

We recently added peering to our KOPF Operator (previously running only one copy). This was done because the Operator is also functioning as a webhook for a Mutating Admission Contoller, and we want the service to be HA. Looking at the logs, it appears the "inactive" KOPF pods do not create/maintain indexes, so if a call to the mutate webhook goes to that pod, the data is missing and the call fails.

Since all running KOPF pods answer webhook calls -- regardless of their peering active/passive status -- it seems they should create and maintain the same indexes at all times.

Kopf version

1.33rc2

Kubernetes version

1.17.6

Python version

3.7.7

Code

@kopf.index(
    'configmaps',
    id='idx_ns_security_plane'
)
def index_security_configmaps(namespace, body, logger, **_):
    """maintain index of security-enabled namespaces, with cached data values"""
    values = body.get('data', {}).get('values.yaml', None)
    if values:
        return {namespace: yaml.safe_load(values)}
    return None


def filter_by_security_plane(namespace, name, idx_ns_security_plane, logger, **_):
    """helper for change handler filters for services in istio-enabled namespaces"""
    # filter out if not in injected namespace
    # makes use of the cached index data, instead of calling the API every time
    if namespace not in idx_ns_security_plane:
        logger.warn(f"Skipping {namespace}/{name} due to missing security configmap label")
        return False

    return True

@k8s.kopf.on.mutate(
    'deployment',
    when=filter_by_security_plane,
    param='mutate_deployment'
)

Logs

[INFO    ] [kopf-test] Indexer 'idx_ns_security_plane' succeeded.
(shows in active pod, but not passive)

call to passive pod:
[kopf-test/deploy-demo] Skipping kopf-test/deploy-demo due to missing security configmap label

Additional information

No response

cjbaar avatar Aug 18 '21 21:08 cjbaar

It's an interesting usage scenario. I never thought of multiple webhook servers in combination with peering.

Technically, freeze is implemented by disconnecting from all API calls and streams at all. Therefore, the operator remains blind to changes. When reconnected, it will reindex all the objects as if just started.

The proposed feature — being permanently connected and aware of the changes, but not reacting to them — totally makes sense. However, it will require significant changes: shifting the point of ignoring the events from the stream to the event processor.

At this moment, I would say that such a scenario is impossible with the current Kopf implementation. For a quick and relatively simple work-around, you can simulate your own peering with your own CRD, and keep both operator instances running at all times. For this, add a timer/daemon that puts the operator's status to the CR, and watch for events to know about other operators (indexing might help), and add when= or if to all other regular resources to only react when the current operator is active.

I will put this request to the backlog — to take a look at it later. There is one related feature in another issue, or, better say, a feature with a similar implementation: separation of watching-/indexing-handlers from change-detection handlers, so that the former ones are triggered immediately and always, while the latter ones are triggered with batching and some delay (i.e. not to skip the low-level events due to "batching", which is currently applied to everything); this part also affects how the streams & events are handled, and it requires some code to be shifted from the stream to the event processors.

nolar avatar Aug 20 '21 07:08 nolar

Thanks for the update. Maybe I didn't understand the use case for peering. It was my impression this is was the reason peering was implemented, so that you could run multiple copies without all of them reacting to changes. Ideally, you would want any system -- but in particular the webhook -- to run as HA.

From what I could garner in the code, I did see that the entire operator is basically pausing when it is not the "active" node, so I imagine it would be trickier to split out specific processes or types of processes.

For now, we have implemented a work-around which goes back to just making extra API calls instead of relying on the indexes. I am actually no longer involved in the project, so someone else might be jumping into this thread, but I will keep an eye on the outcome.

cjbaar avatar Aug 20 '21 15:08 cjbaar

so that you could run multiple copies without all of them reacting to changes

You understood it right — that is the goal. However, peering was implemented long before webhooks and was not taken into consideration when webhooks were implemented. So, webhooks added a new kind of use-cases — like the one described in this issue. I have to think about what could be a good solution here.

nolar avatar Aug 20 '21 15:08 nolar