dragonfly-operator icon indicating copy to clipboard operation
dragonfly-operator copied to clipboard

Data Loss when Instances Crash / Restart

Open kostasrim opened this issue 6 months ago • 10 comments

Describe the bug DragonflyDB suffers data loss after cluster failover when using RDB backup format.

To Reproduce Steps to reproduce the behavior:

Set up 3 instance DragonflyDB in Kubernetes with t4g instances with RDB backup format. Insert some data. Delete all 3 instances of DragonflyDB. Wait for cluster to report as healthy. Observe that Data loss has happened. Expected behavior Backups happen as expected and data loss does not happen.

Screenshots

Image

Datadog logs attached below: extract-2025-06-23T18_29_51.608Z.csv

Environment (please complete the following information):

OS: AWS Amazon Linux Kernel: Linux ds2-dragonfly-dev-0 6.1.134-152.225.amzn2023.aarch64 Containerized?: Kubernetes EKS 1.33 Dragonfly Version: 1.31.0

kostasrim avatar Jun 24 '25 08:06 kostasrim

moved it from dragonfly issue here https://github.com/dragonflydb/dragonfly/issues/5352

@jzkinetic FYI

kostasrim avatar Jun 24 '25 08:06 kostasrim

@jzkinetic

when you say cluster - you mean a master with 2 replicas? Seems that a node loaded the snapshot but then the operator randomly assigned roles and that node became a replica and flushed the data

Am I correct @jzkinetic ?

romange avatar Jun 24 '25 09:06 romange

@jzkinetic

when you say cluster - you mean a master with 2 replicas? Seems that a node loaded the snapshot but then the operator randomly assigned roles and that node became a replica and flushed the data

Am I correct @jzkinetic ?

Yes, by cluster I meant 1 master & 2 replicas. That I am not sure. I had thought it'd be a problem with DragonflyDB dropping data, not with the Operator, so I haven't checked there yet.

jzkinetic avatar Jun 24 '25 18:06 jzkinetic

@romange Issue was observed to happen again.

Image

Here are logs from this data loss event. extract-2025-06-26T12_15_32.460Z.csv

jzkinetic avatar Jun 26 '25 12:06 jzkinetic

the time is 18:00 on 25th, you attached logs from 26th. In any case, it seems as an operator issue - when pods are recreated, the pod that was master before becomes replica and looses the data it loads. We are currently overloaded with other stuff so I am not sure when we be able to fix this issue.

We are welcome contributions from our users by the way.

romange avatar Jun 26 '25 12:06 romange

the time is 18:00 on 25th, you attached logs from 26th.

I believe this is because Datadog exports logs in UTC-0 time, but when browsing the charts, it's set to my time zone.

Do you know of a viable workaround? Right now, this prevents us from using DragonflyDB for any long-term persistent data.

jzkinetic avatar Jun 26 '25 20:06 jzkinetic

Hi @jzkinetic to make I understand, we have 3 nodes: A (master), B (replica), C (replica) Only A is configured to take snapshots We kill all nodes simultaneously, following they are created as: B (master) with no data, A (replica) replicates empty B, C (replica)

Is that right?

ashotland avatar Jun 26 '25 21:06 ashotland

That's not my understanding. With Dragonfly Operator, I think all dragonfly pods are configured to take snapshots, even if they're replicas. When all nodes are killed and restarted, Some Data loss is expected but not Everything. because DragonflyDB doesn't currently support AOF, so when the cluster is rebooted, it starts back up from the last stored snapshot. Everything that was written between the last snapshot and the restart is lost.

However, this isn't happening every time. Sometimes, it decides to wipe everything and start from scratch.

jzkinetic avatar Jun 26 '25 21:06 jzkinetic

@jzkinetic - I've noticed that in the logs file attached all the 'Load finished' log lines mentions 10 keys:

e.g. I20250626 00:42:59.792482 46 server_family.cc:1153] Load finished, num keys read: 10

Are you expecting only 10 keys, of what total size?

ashotland avatar Jun 29 '25 13:06 ashotland

We do have only 10 keys, but keys are Several hundred megabytes in size each. They are a combination of Hashes and Sorted Sets

jzkinetic avatar Jun 30 '25 10:06 jzkinetic