fdb-kubernetes-operator icon indicating copy to clipboard operation
fdb-kubernetes-operator copied to clipboard

Prevent fdbcli status drift (WIP)

Open manfontan opened this issue 3 years ago • 10 comments
trafficstars

Description

Update strategy in some cases may be to aggressive deleting pods. This is because the Fault Tolerance check is not very restrictive. Checking only the status json for availability. Adding additional safeguards like checking the POD state would make this verification more reliable. Preventing the operator from replacing too many pods while others are still not running.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Discussion

The idea here is to compare the status json output with the cluster status. To make sure they have the same processes. If required we could make a more fined grained comparison but this would be a first step. If any of the processes returned by the fdbcli status call don't have a matching pod we will re-queue and to try again later.

Testing

Please describe the tests that you ran to verify your changes. Unit tests? Manual testing?

Do we need to perform additional testing once this is merged, or perform in a larger testing environment?

Documentation

Did you update relevant documentation within this repository?

If this change is adding new functionality, do we need to describe it in our user manual?

If this change is adding or removing subreconcilers, have we updated the core technical design doc to reflect that?

If this change is adding new safety checks or new potential failure modes, have we documented and how to debug potential issues?

Follow-up

Are there any follow-up issues that we should pursue in the future?

Does this introduce new defaults that we should re-evaluate in the future?

manfontan avatar May 30 '22 09:05 manfontan

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-kubernetes-operator-pr
  • Commit ID: bbc1e51c1276e215e656e64cafb1bb317e97e04a
  • Result: FAILED
  • Error: Error while executing command: docker push --quiet ${REGISTRY}/${OPERATOR_IMAGE}. Reason: exit status 125
  • Build Logs (available for 30 days)

foundationdb-ci avatar May 30 '22 09:05 foundationdb-ci

AWS CodeBuild CI Report for Linux CentOS 7

  • CodeBuild project: fdb-kubernetes-operator-pr
  • Commit ID: 9a20d81d6b1622a79522974d5fb1cb3b49701919
  • Result: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)

foundationdb-ci avatar Jun 02 '22 11:06 foundationdb-ci

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: ed31ddde86c42cbde89b74cf4382615acd93fab6
  • Duration 0:04:17
  • Result: :x: FAILED
  • Error: Error while executing command: docker build -t ${OPERATOR_IMAGE} -f Dockerfile .. Reason: exit status 2
  • Build Logs (available for 30 days)

foundationdb-ci avatar Jun 21 '22 16:06 foundationdb-ci

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: b3664a8984a311fd34b814d6cf98c1fc96f35353
  • Duration 1:33:50
  • Result: :x: FAILED
  • Error: Error while executing command: make -C tests -kj prOperator. Reason: exit status 2
  • Build Logs (available for 30 days)

foundationdb-ci avatar Jun 21 '22 18:06 foundationdb-ci

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: ce6b24d95e66c455d2d37ac48060233e8897f921
  • Duration 1:41:47
  • Result: :x: FAILED
  • Error: Error while executing command: make -C tests -kj prOperator. Reason: exit status 2
  • Build Logs (available for 30 days)

foundationdb-ci avatar Jun 21 '22 18:06 foundationdb-ci

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 6687dfcb75aa49c7268d9bb75c60cddb0427cc25
  • Duration 1:41:38
  • Result: :x: FAILED
  • Error: Error while executing command: make -C tests -kj prOperator. Reason: exit status 2
  • Build Logs (available for 30 days)

foundationdb-ci avatar Jun 21 '22 18:06 foundationdb-ci

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 1210d2e44c151c93d06373381cdf263282d16e10
  • Duration 0:51:47
  • Result: :x: FAILED
  • Error: Error while executing command: make -C tests -kj prOperator. Reason: exit status 2
  • Build Logs (available for 30 days)

foundationdb-ci avatar Jun 22 '22 10:06 foundationdb-ci

Hi @johscheuer

Please accept my apology for the delay in my response and thank you for all your insights. They are really helpful.

If you could provide some logs or additional information that shows that is a valid case, we might be able to help work on a potential solution.

I have reviewed the logs for the fdb-kubernetes-operator-controller-manager covering the time of the incident but I cannot find any obvious issues. We are currently running a 1.4.x version of the operator so fault tolerant check logging is not reported. I plan to upgrade to at least v1.5.0 to get fault tolerant check logs. Hopefully this will help.

I cannot provide them on github unfortunately. If there is a safe way of sharing the logs that could be an option, but I should likely require approval to share them.

Just out of curiosity was there and incident/issue where you observed this behaviour and are you able to share some information. about it? just to better understand the motivation behind this PR.

Yes. We have seen our number of replicas going down to 1 several times. Our expectation is that it should never go below 2. Since we have triple replication in place. When the update process starts we expect to the fault tolerance to be reduced by 1. Then the Operator should block additional deletes until the fault tolerance is restored.

In summary we will upgrade to v1.5 or later and I will monitor the next upgrades to gather logs (I am currently looking at the controller manager logs) and metrics in order to get a better picture of what is causing this issue. If you have any suggestions or additional information is required to troubleshoot this issue please let me know.

manfontan avatar Aug 03 '22 15:08 manfontan

Hi! I reproduced this issue with version 1.8.1 of the fdb-kubernetes-operator, see the operator logs in the drop down.

Setup

Spec:
  Database Configuration:
    Logs:                             5
    Proxies:                          3
    redundancy_mode:                  triple
    Storage:                          5
    storage_engine:                   ssd
  Minimum Uptime Seconds For Bounce:  600
  Process Counts:
    Log:        5
    Stateless:  3
    Storage:    5

Running Version:  6.3.12

To make sure that no recreated pods could be scheduled, I cordoned all the nodes. kubectl cordon <node>. I then triggered a rolling bounce pod update strategy (by changing the sidecar image tag).

We can see the reported fault tolerance from status/json is 2 just after two of the storage pods are detected as MissingProcesses (one of them also reporting PodPending). The kubernetes operator ends up deleting and recreating three pods (as pending), before the reported fault tolerance from status/json goes below 2.

End result is the cluster being unavailable, as a whole team of pods are not running (3 out of 5 storage pods are pending).

Operator Logs
{"level":"info","ts":1665001774.3598943,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"namespace","cluster":"foundationdb-cluster","subReconciler":"controllers.updateStatus"}
{"level":"info","ts":1665001774.360013,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"namespace","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1665001774.4088979,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"namespace","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1665001774.4784985,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-2","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001774.4785688,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-3","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001774.47859,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-5","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001774.4786031,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-6","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001774.4786162,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-9","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
...
{"level":"info","ts":1665001774.8717215,"logger":"controller","msg":"Deleting pods","namespace":"namespace","cluster":"foundationdb-cluster","reconciler":"updatePods","zone":"26081535-vmss0000yt","count":1,"deletionMode":"Zone"}
{"level":"info","ts":1665001774.9156864,"logger":"controller","msg":"Delaying requeue for sub-reconciler","namespace":"namespace","cluster":"foundationdb-cluster","subReconciler":"controllers.updatePods","message":"Pods need to be recreated","error":null}
{"level":"info","ts":1665001774.9158194,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"namespace","cluster":"foundationdb-cluster","subReconciler":"controllers.removeProcessGroups"}
{"level":"info","ts":1665001774.9160814,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"namespace","cluster":"foundationdb-cluster","subReconciler":"controllers.removeServices"}
{"level":"info","ts":1665001774.9162133,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"namespace","cluster":"foundationdb-cluster","subReconciler":"controllers.updateStatus"}
...
{"level":"info","ts":1665001774.9581664,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"namespace","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1665001777.0301375,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-2","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001777.0301926,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-3","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001777.0302026,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-5","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001777.030211,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-6","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001777.0302234,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-9","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","SidecarUnreachable"]}
...
{"level":"info","ts":1665001809.3765922,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-2","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001809.3766372,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-3","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001809.3766472,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-5","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001809.3766623,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-6","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001809.3766763,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-9","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","SidecarUnreachable","MissingPod"]}
...
{"level":"info","ts":1665001831.3356743,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"namespace","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1665001835.3488076,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"namespace","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1665001835.3498478,"logger":"controller","msg":"Check desired fault tolerance","namespace":"namespace","cluster":"foundationdb-cluster","reconciler":"updatePods","expectedFaultTolerance":2,"maxZoneFailuresWithoutLosingData":2,"maxZoneFailuresWithoutLosingAvailability":2}
{"level":"info","ts":1665001835.3498828,"logger":"controller","msg":"Taking lock on cluster","namespace":"namespace","cluster":"foundationdb-cluster","action":"updating pods"}
{"level":"info","ts":1665001835.3498986,"logger":"controller","msg":"Deleting pods","namespace":"namespace","cluster":"foundationdb-cluster","reconciler":"updatePods","zone":"26081535-vmss0000y4","count":1,"deletionMode":"Zone"}
...
{"level":"info","ts":1665001847.4786723,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-2","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","MissingProcesses"]}
{"level":"info","ts":1665001847.4787035,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-3","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001847.4787333,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-5","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001847.478752,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-6","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001847.4787745,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-9","state":"HasUnhealthyProcess","conditions":["SidecarUnreachable","MissingProcesses","PodPending"]}
...
{"level":"info","ts":1665001868.457372,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"namespace","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1665001872.4875183,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"namespace","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1665001872.4885874,"logger":"controller","msg":"Check desired fault tolerance","namespace":"namespace","cluster":"foundationdb-cluster","reconciler":"updatePods","expectedFaultTolerance":2,"maxZoneFailuresWithoutLosingData":2,"maxZoneFailuresWithoutLosingAvailability":2}
{"level":"info","ts":1665001872.4886274,"logger":"controller","msg":"Taking lock on cluster","namespace":"namespace","cluster":"foundationdb-cluster","action":"updating pods"}
{"level":"info","ts":1665001872.4886386,"logger":"controller","msg":"Deleting pods","namespace":"namespace","cluster":"foundationdb-cluster","reconciler":"updatePods","zone":"26081535-vmss0000y5","count":1,"deletionMode":"Zone"}
...
{"level":"info","ts":1665001878.5913925,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-2","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","MissingProcesses","MissingPod"]}
{"level":"info","ts":1665001878.5913997,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-3","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","IncorrectConfigMap"]}
{"level":"info","ts":1665001878.5914106,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-5","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","IncorrectConfigMap"]}
{"level":"info","ts":1665001878.5914423,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-6","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","SidecarUnreachable","IncorrectConfigMap"]}
{"level":"info","ts":1665001878.591463,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-9","state":"HasUnhealthyProcess","conditions":["SidecarUnreachable","MissingProcesses","PodPending","IncorrectConfigMap"]}
...
{"level":"info","ts":1665001923.0192153,"logger":"controller","msg":"Check desired fault tolerance","namespace":"namespace","cluster":"foundationdb-cluster","reconciler":"updatePods","expectedFaultTolerance":2,"maxZoneFailuresWithoutLosingData":1,"maxZoneFailuresWithoutLosingAvailability":1}
{"level":"info","ts":1665001923.0193546,"logger":"controller","msg":"Reconciliation terminated early","namespace":"namespace","cluster":"foundationdb-cluster","subReconciler":"controllers.updatePods","requeueAfter":15,"message":"Reconciliation requires deleting pods, but deletion is currently not safe"}
...
{"level":"info","ts":1665001890.7436044,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-2","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","MissingProcesses","MissingPod"]}
{"level":"info","ts":1665001890.7436225,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-3","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","IncorrectConfigMap"]}
{"level":"info","ts":1665001890.7436473,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-5","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","IncorrectConfigMap"]}
{"level":"info","ts":1665001890.7436566,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-6","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec","SidecarUnreachable","IncorrectConfigMap"]}
{"level":"info","ts":1665001890.7436712,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-9","state":"HasUnhealthyProcess","conditions":["SidecarUnreachable","MissingProcesses","PodPending","IncorrectConfigMap"]}
...
{"level":"info","ts":1665001951.2949114,"logger":"controller","msg":"Check desired fault tolerance","namespace":"namespace","cluster":"foundationdb-cluster","reconciler":"updatePods","expectedFaultTolerance":2,"maxZoneFailuresWithoutLosingData":0,"maxZoneFailuresWithoutLosingAvailability":0}
{"level":"info","ts":1665001951.2950678,"logger":"controller","msg":"Reconciliation terminated early","namespace":"namespace","cluster":"foundationdb-cluster","subReconciler":"controllers.updatePods","requeueAfter":15,"message":"Reconciliation requires deleting pods, but deletion is currently not safe"}
...
{"level":"info","ts":1665001955.387233,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-2","state":"HasUnhealthyProcess","conditions":["MissingProcesses","PodPending"]}
{"level":"info","ts":1665001955.3872955,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-3","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001955.3873107,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-5","state":"HasUnhealthyProcess","conditions":["IncorrectPodSpec"]}
{"level":"info","ts":1665001955.3873217,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-6","state":"HasUnhealthyProcess","conditions":["SidecarUnreachable","MissingProcesses","PodPending"]}
{"level":"info","ts":1665001955.387333,"logger":"controller","msg":"Has unhealthy process group","method":"CheckReconciliation","namespace":"namespace","cluster":"foundationdb-cluster","processGroupID":"storage-9","state":"HasUnhealthyProcess","conditions":["SidecarUnreachable","MissingProcesses","PodPending","IncorrectConfigMap"]}
...
{"level":"info","ts":1665001983.75503,"logger":"controller","msg":"Check desired fault tolerance","namespace":"namespace","cluster":"foundationdb-cluster","reconciler":"updatePods","expectedFaultTolerance":2,"maxZoneFailuresWithoutLosingData":-1,"maxZoneFailuresWithoutLosingAvailability":-1}
{"level":"info","ts":1665001983.7551894,"logger":"controller","msg":"Reconciliation terminated early","namespace":"namespace","cluster":"foundationdb-cluster","subReconciler":"controllers.updatePods","requeueAfter":15,"message":"Reconciliation requires deleting pods, but deletion is currently not safe"}

simenl avatar Oct 06 '22 11:10 simenl

Hi! I reproduced this issue with version 1.8.1 of the fdb-kubernetes-operator, see the operator logs in the drop down.

Setup

Spec:
  Database Configuration:
    Logs:                             5
    Proxies:                          3
    redundancy_mode:                  triple
    Storage:                          5
    storage_engine:                   ssd
  Minimum Uptime Seconds For Bounce:  600
  Process Counts:
    Log:        5
    Stateless:  3
    Storage:    5

Running Version:  6.3.12

To make sure that no recreated pods could be scheduled, I cordoned all the nodes. kubectl cordon <node>. I then triggered a rolling bounce pod update strategy (by changing the sidecar image tag).

We can see the reported fault tolerance from status/json is 2 just after two of the storage pods are detected as MissingProcesses (one of them also reporting PodPending). The kubernetes operator ends up deleting and recreating three pods (as pending), before the reported fault tolerance from status/json goes below 2.

End result is the cluster being unavailable, as a whole team of pods are not running (3 out of 5 storage pods are pending).

Operator Logs

Thanks for reporting I'll take a look and come back to that next week.

johscheuer avatar Oct 07 '22 15:10 johscheuer

🤔 not sure if this could be part of the problem. ~But I have realised that we are using FoundationDBCluster api version v1beta1~ Actually we have both CRDs. Maybe cleaning up and using only v1beta2 and enabling/configuring automatic replacements may help?

manfontan avatar Oct 21 '22 14:10 manfontan

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: d3fc5f9e0481dcb371c10fdef193a8e7babe39c3
  • Duration 4:09:20
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

foundationdb-ci avatar Nov 23 '22 14:11 foundationdb-ci

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 53ac6905a60dac29982e94f4a7fda12e53fb5782
  • Duration 4:09:06
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

foundationdb-ci avatar Nov 23 '22 14:11 foundationdb-ci

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: d2713454916965aaac48162c6fdf4e359313dbe2
  • Duration 4:09:10
  • Result: :white_check_mark: SUCCEEDED
  • Error: N/A
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

foundationdb-ci avatar Nov 25 '22 15:11 foundationdb-ci

I am going to close this PR since #1444 will address this issue.

manfontan avatar Mar 09 '23 13:03 manfontan