dragonfly-operator icon indicating copy to clipboard operation
dragonfly-operator copied to clipboard

Operator take no decision when connection is lost

Open SoGooDFR opened this issue 1 year ago • 2 comments

Hi,

The 3 August, GCP europe-west9-a zone goes down, our master was on this zone and others replicas on europe-west9-b/europe-west9-c. So we have 3 pods, perfect for "quorum", but dragonfly-operator take no decision about switching master.

Timeline:

  • 13:29 GCP europe-west9-a goes down (Node status : Unreachable) and so, Master pod with it
  • 13:29 Dragonfly-operator loop this message: Master pod is not ready yet, will requeue
  • ...
  • 13:36 Manual action : Delete the master pod

I think, this is unbelievable about operator to take no decision about this situation, the connection was lost since many minutes, so we need to promote another healthy replica to master ?

SoGooDFR avatar Aug 06 '24 09:08 SoGooDFR

Hi @SoGooDFR, sorry for the incident. This should not happen as we patched a fix for this in v1.1.3. What is the version you are using?

Abhra303 avatar Aug 08 '24 07:08 Abhra303

From the log message you shared it seems like you are using >=v1.1.3. Currently, we do failover if master tries to restart (in your case the node itself got down). So, the failover unfortunately wasn't triggered. We need to strengthen our health check and failover logic so this may never happen again. I will fix it asap. Again sorry for the incident.

Abhra303 avatar Aug 08 '24 07:08 Abhra303

Seems same issue as https://github.com/dragonflydb/dragonfly-operator/issues/227 ?

ashotland avatar Aug 18 '24 12:08 ashotland

Hey @Abhra303, Any update on this?

We also found it out while evaluating dragonfly and are now blocked from using it.

If you believe there will not be a fix soon, we will look for something else.

Thanks

orenhecht avatar Aug 21 '24 11:08 orenhecht

Hi @orenhecht, We will patch a release next week.

Abhra303 avatar Aug 21 '24 13:08 Abhra303