Old primary blocking Kubegres from proceeding
I have a cluster with kubegres deployed. We have a primary and 2 replicas. Unfortunately our network is rather bad so we see the following problem a lot and it's a decent amount of manual intervention to fix it (I can fix it via methods not specified here). I'd like to know the right way or an easier way to fix it though. Unfortunately I cannot provide logs.
Scenario: Either there is a network or nfs outtage, the primary fails, this outtage continues to for a bit. The primary dies in some capacity, the database rolls over a few times and eventually we wind up in a state where we have a dead primary complaining about a bad timeline segment and a replica that is available. I can confirm the replica has sufficient information, so I'd prefer to just forget about the old primary and simply make the replica the new primary, I don't care if there is minimal data loss.
Currently in this sort of scenario Kubegres logs that it has basically gone hands off the cluster "Until we fix it manually." Which, I guess is fine. The steps I take to attempt to restore the replica are:
- Promote it using pg_ctl
- Set the promotePod in the kubegres config
- Label the statefulset and the pod to be primary
- Delete the statefulset and backing storage of the old failing primary instance
At this point I'd expect Kubegres to simply take over, use the replica as the new primary, and create 2 new replicas. It doesn't do that. Instead it keeps complaining about the dead primary that is not even in kubernetes anymore and completely ignoring the one I promoted and labeled.
So I have two question
- What is the correct way to resolve the scenario with kubegres so I am working with kubegres and not against it
- Is there an easy way to tell kubegres on the promotePod "I'm the boss, I said promote this pod, ignore the other pods, use this one and move on with life." Like a
forcePromote: trueor something? Or a way to redeploy and simply tell it to use an old pvc/pv that I know of for the first instance?
We do have the same problem. We are running a HA-setup with Postgres (3 Pods) and one of them just got killed tonight. We have one pod in the "ready" state and one other which is dead too. We added the spec.failover.promotePod and expected Kubegres to failover to the last working pod. That's not happening because the Controller is getting the last status from the CustomResource. The last status is a blockingOperation. If seems like the operator is being blocked by the status and can not go forward.
Did you find a solution for this problem or did you just throw away the cluster and restored it from a backup?
We solved the problem in a non-operator-friendly-way, but it's working.
As I mentioned earlier, we had a blockingOperation in the CustomResource which the operator cried about. So the only logic way was to remove this blockingOperation. In our case the blocking pod did not exist anymore. Obviously you should not edit the status field, but in this case it was necessary.
We scaled the deployment of the controller down to 0 and patched the CustomResource to remove the keys status.previousBlockingOperation and status.blockingOperation.
kubectl patch Kubegres postgresql -n postgres-staging \
--type json \
-p '[{"op": "remove", "path": "/status/blockingOperation"}, {"op": "remove", "path": "/status/previousBlockingOperation"}]' \
--subresource status
kubegres.kubegres.reactive-tech.io/postgresql patched
After this we updated the deployment back to 1. Now the controller restarted the last remaining pod and labeled it as primary. After this it started the two remaining replicas and seeded them from the primary. The database is now running as expected.
One important information. I am using the GUI "Lens". Editing the CustomResource via Lens did not work for me. It showed the green banner, but the status wen back to the original value - even with the controller turned off. Not sure how that happend :D Probably just a Lens Caching Error. But changing the CustomResource via Kubectl worked.
As I said, this is not really operator friendly because you are touching areas you are not supposed to touch. But that seems the only way to fix the issue here. A function in the operator to evaluate the blocking state or a flag to override the state for scenarios like this would be helpful.
Thank you for sharing how you solved this issue. If you had to implement a solution for this, what would you suggest in details?
For example, you mentioned: "A function in the operator to evaluate the blocking state or a flag to override the state for scenarios like this would be helpful."
I am happy to implement a long term solution and would be interested in the way we could achieve it. And if there is a way to reproduce this issue consistently so that I could write a test.
The problem is I don't really know the inner workings of the operator. Therefore I can not make any good suggestions without potentially breaking other parts.
The error here was that two replicas failed and one was still running. So the logic way would be to promote the running replica to primary. But the controller couldn't do this because the CustomResource of the cluster had the blocking status active. As mentioned we even tried to use the manual failover to the working instance but this was not possible, because of the blocking status.
The controller said in the logs that a manual fix is required because the target replica, that was also written in the status and should be upgraded to primary, wasn't there anymore. One way to fix this could be that the blocking operation could be overwritten by the spec.failover.promotePod. So if you are manually adjust the file and setting a new replica, this should count and not a blocking state.
Hope that helps.
Thank you for your suggestion. I can make a code change so that spec.failover.promotePod will cancel the override mode and the provided Pod id will be promoted.
You mentioned that the error happened because two replicas failed while one was still running. Is there an easy way to reproduce this?
No, unfortunately not. Originally we had 3 replicas. One pod was gone, one replica was still running but not upgraded to primary due to the issue and one other was in a restart-loop with the following logs:
This was the CustomResource:
apiVersion: kubegres.reactive-tech.io/v1
kind: Kubegres
metadata:
creationTimestamp: '2024-03-06T14:40:37Z'
generation: 2
name: postgresql
namespace: postgres-staging
resourceVersion: '279992767'
uid: 92036869-ee16-4671-9c22-708207321c19
selfLink: >-
/apis/kubegres.reactive-tech.io/v1/namespaces/postgres-staging/kubegres/postgresql
status:
blockingOperation:
operationId: Primary DB count spec enforcement
statefulSetOperation:
instanceIndex: 11
name: postgresql-11
statefulSetSpecUpdateOperation: {}
stepId: Failing over by promoting a Replica DB as a Primary DB
timeOutEpocInSeconds: 1744432178
enforcedReplicas: 6
lastCreatedInstanceIndex: 13
previousBlockingOperation:
hasTimedOut: true
operationId: Primary DB count spec enforcement
statefulSetOperation:
instanceIndex: 11
name: postgresql-11
statefulSetSpecUpdateOperation: {}
stepId: >-
Waiting few seconds before failing over by promoting a Replica DB as a
Primary DB
timeOutEpocInSeconds: 1744431877
spec:
backup:
pvcName: postgres-backups
schedule: 0 */3 * * *
volumeMount: /var/lib/backup
customConfig: base-kubegres-config
database:
size: 50Gi
storageClassName: hcloud-volumes
volumeMount: /var/lib/postgresql/data
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
key: superUserPassword
name: postgres-secret
- name: POSTGRES_REPLICATION_PASSWORD
valueFrom:
secretKeyRef:
key: replicationUserPassword
name: postgres-secret
failover: {}
image: postgres:14.7
port: 5432
probe:
livenessProbe:
exec:
command:
- sh
- '-c'
- exec pg_isready -U postgres -h $POD_IP
failureThreshold: 10
initialDelaySeconds: 60
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 15
readinessProbe:
exec:
command:
- sh
- '-c'
- exec pg_isready -U postgres -h $POD_IP
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
replicas: 3
resources: {}
scheduler:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- postgresql
topologyKey: kubernetes.io/hostname
weight: 100
volume: {}
I think the problem was, that the controller tried to give postgres-11 the upgrade to primary but it was in a restart loop. That brings up another question. Why did the operator choose this pod to be a primary, while in a NotReady state? Probably at the time of the selection this pod was in a healthy state and was selected but then went into the restart loop because of some other error.
This could be fixed with multiple ready checks from the controller side. For example if the operator can not promote the pod to primary and is going into the timeout (300s), let the controller check for a Ready state. If not, select one other Ready-pod to be promoted. This could prevent the blocking state.
I hope my feedback is helpful to you :)
Thank you for sharing this. It is very helpful.
I am going to write a fix for this issue this Friday and perhaps I could have a fix ready by the end of April.
The most challenging part is to be able to consistently reproduce this issue so that I could write an automatised a test which reproduces this issue and then I write a fix which allows the test to pass.
We solved the problem in a non-operator-friendly-way, but it's working.
That is largely the methods I have had to take. Which is why I made this ticket. I was hoping for a way to work with the operator instead of against it.
- There's the way you mentioned
- backup and restore is also an option but we did not have backups at the time and there's a decent amount that goes into teaching people, getting the required software, where to store them etc.
- I can also log onto our NFS and wipe out the bad timelines so the pod starts but has bad data. As stated I know I have a good replica but once the bad primary is restored Kubegres takes over handling the cluster again so I just delete the stateful sets that have bad data and Kubegres will promote the good replica to be a primary and create two new replicas. It won't do that while it has any pods with problems though which is why I have to bring the old primary up even if it has bad data.
- I can swap the PVs backing the databases behind the scenes and just spin up a new cluster then re-attach the good PV to it
So far option 3 has been the easiest. Most of the time the scenario is an SA forgot to put the cluster into maintenance mode. So we don't really know what went wrong, just that the cluster is not happy. We do know we have a good Replica with good data we want to use, but Kubegres just won't promote it until everything is happy and dandy again.
If I had to personally recommend an approach I'd probably add a forcePromote option to the promotePod hook. Either a complete new key or make it a json object that takes the podName and force keys. Really doesn't matter to me how.
But basically a way to tell it to forget about the current blocker and just promote a pod to be the primary, then it can either delete the bad ones or let the user know (via documentation or logs I guess) to delete the bad ones. Then kubegres from what I have seen will spin up new good replicas connected to that primary it just promoted.
Any word on this?