cloudnative-pg icon indicating copy to clipboard operation
cloudnative-pg copied to clipboard

adding failoverDelay cluster parameter to wait for a given number of seconds before triggering a failover

Open francoispqt opened this issue 3 years ago • 7 comments

This PR adds a feature to add a delay before triggering a failover. It introduces a new failoverDelay parameter to the Cluster spec.

The approach is to check for the delay when running updateTargetPrimaryFromPods in the cluster reconciler and to continue reconcilation loop as usual if the primary is unhealthy but failoverDelay has not passed. To check failover delay I added a currentPrimaryFailingSince property on the cluster status, this property holds a timestamp of the first time the primary was considered unhealthy. As soon as the current primary is considered healthy again in a reconciliation loop, this timestamp is unset.

I'm not entirely sure of all the impacts it could have of continuing reconciliation loop if primary is unhealthy but from the different manual tests I've done it seems fine. I've added some basic unit tests to confirm the behaviour and confirm this is not a breaking change.

#507

francoispqt avatar Jul 29 '22 14:07 francoispqt

@francoispqt there's some test that aren't passing, can you please fix so we can run the E2E test?

sxd avatar Jul 29 '22 22:07 sxd

I've pushed changes to fix linting errors

francoispqt avatar Aug 01 '22 15:08 francoispqt

I've squashed commits. All tests that are allowed to run are passing now.

francoispqt avatar Aug 03 '22 08:08 francoispqt

Any update on this please?

Thanks :pray:

francoispqt avatar Aug 09 '22 09:08 francoispqt

@francoispqt in the community meeting we were talking about this PR and we were wondering if you finish to work on it and if you will agree on joining in the next community meeting to talk about it?

sxd avatar Aug 17 '22 14:08 sxd

@sxd I'd be happy to talk about it during the next community meeting, where can I find the schedule for these meetings?

I will add a e2e test to make the PR complete. I'd like your point of view on one thing. Should we requeue after the time left on the failover delay if we don't requeue after a certain time during a loop that sets the CurrentPrimaryFailingSince? Doing it would be to make sure a reconcilation loop would trigger when the delay passes to effectively start a failover.

francoispqt avatar Aug 23 '22 13:08 francoispqt

@francoispqt You can find the schedule here: https://github.com/cloudnative-pg/cloudnative-pg/blob/main/CONTRIBUTING.md#meetings I may not be there for that meeting (I'll try) but I'm pretty sure that @leonardoce and @phisco will want to hear from you. And yes, we should requeue to make sure the trigger will happen after the delay has finished

sxd avatar Aug 24 '22 19:08 sxd

@francoispqt we talked about this PR again in the community meeting and we would like to talk with you, it's that possible? I wanted to sort this out before the 1.18 release

sxd avatar Oct 19 '22 14:10 sxd

@francoispqt any news?

gbartolini avatar Nov 23 '22 09:11 gbartolini

work will continue on #1366 (easier to run tests)

armru avatar Jan 27 '23 15:01 armru

Closed by df820b7bd3d665ac0a9bf95fd12a653ef54fbe4f

mnencia avatar Feb 06 '23 17:02 mnencia

:exclamation: By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested :arrow_backward: or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

github-actions[bot] avatar Feb 06 '23 17:02 github-actions[bot]