etcd data integrity guarantees on power loss of machine with SSD with volatile write cache and -o nobarrier

Hi,

I'm asking strange :) First answer to my question would be - don't do what I'm asking. It is not recommended to disabling write barriers on disks with volatile disk write caches e.g. DRAM without battery.

However theoretically there might be cases, especially in the world of distributed systems, where we can allow one node to be broken because we can detect it and catch up/repair. Theoretically.

Let me clarify etcd details.

How will etcd cluster behave in case of sudden power loss of one node with SSD disk which has only DRAM volatile (no battery) write cache?

What may happen in the worst case and what in the best case?

Probably in the best case, cluster can repair this node: somehow catch up quickly by detecting revision id difference.

Probably in the worst case, cluster will not mention data loss of one node, if the node after start can maintain the same revision id but actual data somehow is broken. In future it may be detected during read operation and then.. since there is consensus algorithm envloved all must be good even in the worst case?

Aug 25 '22 17:08 b10s

It's really going to depend on what happens with the other nodes in the cluster. If it's a general datacenter outage and all nodes shut down without time to flush, that's a different situation than just one node losing power. Or: was the nobarrier node the leader at the time? Unreliable fsync (which this is a special case of) is one way to get an unrecoverable cluster.

The real issue here -- if there is one for etcd work -- is how soon do we detect the corruption? Because if we can figure out that one node is corrupted before the other nodes are, there's a chance to recover. Realistically, though, not all outcomes of corruption are going to be easily detectable.

Aug 25 '22 18:08 jberkus

How will etcd cluster behave in case of sudden power loss of one node with SSD disk which has only DRAM volatile (no battery) write cache?

I think this falls into category of "Avoiding persistent storage writes" (part 11.7.3 of raft paper). Also see discussion in raft-dev

My understanding is the following:

If you have stable storage. In case of majority/all of the nodes going down you loose only availability. When nodes are back, they can form quorum and there will be no data loss.
In your example storage is 'partially' stable. So in case of majority/all nodes going down at the same time - you loose availability and durability. When nodes are back, some data might be missing.
In both cases - if majority of the nodes are operational at all times - you have durability and availability.

Also, during our discussion at the meeting I've mentioned byzantine failure. I don't think it's applicable here. I was thinking about situation when raft follower will end up writing data that's not is AppendEntries request from Leader. But in your scenario we just have cache that can disappear.

Sep 05 '22 19:09 lavacat

Closing as part of migration of issues labeled as questions to GitHub discussions. GitHub discussions makes it easier for the whole community to provide answered.

If you think your question is still relevant, feel free to ask at https://github.com/etcd-io/etcd/discussions

Sep 28 '22 08:09 serathius

etcd etcd copied to clipboard

data integrity guarantees on power loss of machine with SSD with volatile write cache and -o nobarrier

etcd
etcd copied to clipboard