etcd icon indicating copy to clipboard operation
etcd copied to clipboard

[Question] What causes Raft WAL to get corrupted?

Open wyrobnik opened this issue 3 years ago • 2 comments

What happened?

We have been running a Raft cluster using etcd's Raft implementation. We started with the raftexample. Occasionally, one node fails to restart due to a corrupted WAL file. During WAL replay wal.ReadAll() [0] returns error io.ErrUnexpectedEOF.

0: https://github.com/etcd-io/etcd/blob/main/server/storage/wal/wal.go#L437

What did you expect to happen?

The WAL to not get in the corrupted state, or an error indicating when/how the WAL gets corrupted.

How can we reproduce it (as minimally and precisely as possible)?

Unknown.

Anything else we need to know?

I have been able to fix the issue, thanks to wal.Repair() [1] introduced in #2597. I'm curious whether this is a known issue (could not find any other issues) and whether it is known how a node can get into this state to begin with.

1: https://github.com/etcd-io/etcd/blob/main/server/storage/wal/repair.go#L30

Etcd version (please run commands below)

$ etcd --version
v3.5.1

Etcd configuration (command line flags or environment variables)

N/A

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

N/A

Relevant log output

I0606 22:09:42.652467    1 raft.go:371] Replaying WAL of member 1
I0606 22:09:48.592134    1 raft.go:433] Replaying Loading WAL at term 140929 and index 2071012367
E0606 22:09:49.035122    1 raft.go:435] w.ReadAll(): unexpected EOF

wyrobnik avatar Jun 06 '22 23:06 wyrobnik

@wyrobnik This happens because of a torn write on the disk. Do you see this happening quite often or was a one-off occurrence?

agargi avatar Jun 08 '22 21:06 agargi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 21 '22 02:09 stale[bot]

Closing as part of migration of issues labeled as questions to GitHub discussions. GitHub discussions makes it easier for the whole community to provide answered.

If you think your question is still relevant, feel free to ask at https://github.com/etcd-io/etcd/discussions

serathius avatar Sep 28 '22 08:09 serathius