fdb-kubernetes-operator
fdb-kubernetes-operator copied to clipboard
Missing processes cause the FoundationDBClusterStatus to be out of sync with actual cluster status
What happened?
In our k8s cluster, we sometimes have processes / nodes that are killed by other systems running in the cluster. When this happens, the node can disappear from the cluster. Sometimes, this seems to result in the the process staying in the FDBClusterStatus, despite no longer being part of the cluster.
We see these log lines:
skip updating fault domain for process group with missing process in FoundationDB cluster status
from the updateStatus reconciler, and the processGroupID is one that no longer exists in the cluster. This then causes problems in certain operations, such as updating pods because some of the reconcilers seem to iterate over the processes from the FDBClusterStatus and tries to fetch their details from k8s, but then they cannot find that pod.
We encountered this on operator version 2.3.0.
What did you expect to happen?
I would expect processes that are no longer reported in the machine readable status to be removed from the FDBClusterStatus
How can we reproduce it (as minimally and precisely as possible)?
I'm not totally sure because I don't have a exact replication, but I think you can just delete a pod or node from k8s while the cluster is running.
Anything else we need to know?
No response
FDB Kubernetes operator
FDB version 7.1.67 Operator version v2.3.0
Kubernetes version
$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.31.601
Cloud provider
Those are different concepts. The ProcessGroup in the FoundationDBClusterStatus represents the logical container for the Kubernetes resources that will be created, e.g. a Pod, a PVC (if stateful), a service (if configured). The information will be present until the process group is removed (not the pod). The goal of the process group is to keep information about the desired processes that should be running. If the operator detects that some configuration for the process group is missing, it will take the required steps to ensure the process group will eventually reach the desired state again. If the underlying Pod is deleted the operator will recreate the Pod with the same configuration, once the Pod is scheduled again, the fdbserver processes should be running again. The log line that you pasted is only a warning log and shouldn't cause any issues. Basically it only tells you that the fault domain will not be updated as the associated fdbserver process is not running (and therefore the fault domain cannot be detected).
This then causes problems in certain operations, such as updating pods because some of the reconcilers seem to iterate over the processes from the FDBClusterStatus and tries to fetch their details from k8s, but then they cannot find that pod.
Can you describe the problems in more details? All operations should handle those situation (if not there is a bug). It sometimes might delay an operation because we have some safe guards to wait some duration before detecting the process as "down" to prevent issues with short network partitions.
edit: Some additional information can be found in this design: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/design/process_group_crd.md (not yet implemented). This issue is a good reminder to work on https://github.com/FoundationDB/fdb-kubernetes-operator/issues/1634 :)
Maybe to make things easier to understand. You can think of the process groups in the FoundationDBClusterStatus as some kind of a StatefuleSet with some special information for the FDB operator (just that this information is not a dedicated CRD/resource and is embedded in the status).
Thanks for the clairification, let me try to elaborate on what we think happened. We are still trying to piece together the timeline and are unsure if there are other things that are wrong with our cluster set up. This is a test cluster, so it's not so bad to lose it, but we want to understand what is happening so that we know how to troubleshoot in a real situation.
Some background context:
- We are attempting a 7.1 upgrade to 7.3
- We have something running in our k8s cluster that randomly kills nodes (and therefore all the pods on that node), so it is up to services to be resilient to this and restart their pods
- At the time, we had configured the
FoundationDBClusterto haveautomationOptions.replacements.enabledtotrue, but we didn't set any other properties in thereplacementshash (e.g.maxConcurrentReplacements). I'm not sure if this is relevant.
The overall sequence of events was this:
- We started the upgrade process by bumping the versions in the FoundationDBCluster resource and applied them to k8s (via a helm chart)
- The operator detected that some clients were not compatible with the newer version, so stopped the reconciliation loop. At this point, we left the cluster here for a few days while we worked on getting the clients to multi-version client
- When we came back, several of the nodes were gone from the cluster, but it seems that the operator did not recreate them. I am not sure if this is because it was stuck in the upgrade reconciliation loop, or if it was because we didn't have the replacements setting configured correctly. a. We could see that the status was reporting a couple of the coordinators were unreachable. After a while, the cluster got to a point where a quorum of the coordinators were gone
- We rolled the upgrade back by reverting the FoundationDBCluster version. This had the effect of creating a bunch of new pods, and at this point it seemed that the operator could bring the cluster back with new coordinators
- Once the cluster was healthy again, we attempted the upgrade again, and this is when we encountered the log lines indicating that it couldn't find the pod for that ProcessGroup. This manifest also as this log line from
addPodson v2.3.0 of the operator when it tried to upgrade:
Pod "foundationdb-cluster-main-log-10058" is invalid: spec.containers[1].image: Required value
I think what happened here is that it tries to fetch the current pod spec for the ProcessGroup and then tries to update it, but the pod no longer exists, so you don't have an image (or any other value).
We also tried upgrading again using v2.6.0 of the operator, and adding in the maxConcurrentReplacements property (I couldn't find a default value, so I assume it was 0, thereby disabling replacements). This time, we get logs like Could not find Pod for process group ID from replaceMisconfiguredProcessGroups and Detected replace process group but cannot replace it because we hit the replacement limit from replaceFailedProcessGroups. So I assume that it's now trying to replace all these missing pods.
I think we got to the bottom of this -- the way our deployments work caused the pod spec to change every time because we have a version number in the pod annotations and we had a sidecar in the podspec whose version also changed every deployment. When this occurs with an upgrade, the new pods are created on the new version of FDB, which is protocol incompatible with the currently running cluster. So then these new processes appear to be missing to the operator.
The solution here was to make sure that nothing in the pod spec changes when we do a version upgrade, then we were able to successfully upgrade from 7.1 to 7.3 using v2.8.0 of the operator.
I'll reopen the issue and mark it as documentation. Thanks for getting back on this. This is something we should be documenting and eventually see if we can fix the behaviour.