snap.broken files filled up the etcd disk space after a few restarting
Bug report criteria
- [x] This bug report is not security related, security issues should be disclosed privately via [email protected].
- [x] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- [x] You have read the etcd bug reporting guidelines.
- [x] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
We're running etcd v3.5 in the Kubernetes Pods, and the attached disks have limited size. The etcd cluster has 5 members and each runs in a separate Pod on different Kubernetes nodes. We restarted the Pods for a few times. After that, a few .snap.broken files were created which filled up the disk space, and thus the etcd services cannot start anymore.
==== #: etcd --version etcd Version: 3.5.21 Git SHA: a17edfd Go Version: go1.23.7 Go OS/Arch: linux/amd64
du -sh *
3.2G snap 800M wal
cd snap/
du -sh *
482M 0000000000003c97-0000000005ff12da.snap 487M 0000000000003c97-000000000600997b.snap 483M 0000000000003с97-0000000006022053.snap 489M 0000000000003c9f-000000000603a784.snap 211M 0000000000003c9f-000000000603a784.snap.broken 483M 0000000000003сa0-0000000006052e25.snap 490M 0000000000003ca0-000000000606b4c6.snap.broken 59M 00000000000003a1-000000000606b4f9.snap.broken 68K db
What did you expect to happen?
If I understand the etcd source code which creates the snap files correctly (server/etcdserver/api/snap/snapshotter.go), seems it can make incomplete snap files in some conditions. For example, if the process is stopped while a new snap file is being created, the file would be leftover as incomplete. Next time when the etcd process starts, it would not be able to load the partial file successfully thus would isolate it to be a .snap.broken file. The .snap.broken files would be skipped in future and won't be purged anymore.
In an enterprise class software, the approach of tmp+rename is usually used to create the critical files. That is, we firstly dump the file content in a temporary file on the same filesystem, then commit the file creation by renaming it with the destination file name. In this way, the incomplete file won't be loaded at all, and the leftover temporary file can be simply discarded automatically next time. I understand this can avoid the .snap.broken files significantly.
I'm not sure if this can be considered a request for enhancement or bug fix. Anyhow it's making trouble when the disk space for etcd is small, and especially the etcd server is containerized.
Hopefully this approach can be used in both v3.5 and newer versions.
How can we reproduce it (as minimally and precisely as possible)?
Keep restarting the etcd members forcibly for times.
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.21
Git SHA: a17edfd
Go Version: go1.23.7
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.21
Etcd configuration (command line flags or environment variables)
This is the command-line flags for the first etcd member. There're 5 etcd members in the etcd cluster.
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Any comment for this issue?
I found the following code segment that seems to be linked to the reported behavior.
The root cause is the non-atomic snapshot write in save() combined with the one-way rename logic in loadSnap(), leading to unbounded accumulation of .snap.broken files.
Details:
-
Function
Snapshotter.saveinserver/etcdserver/api/snap/snapshotter.go, lines 75-104: Writes directly to final .snap filename via pioutil.WriteAndSyncFile. A crash or kill during write leaves a partial/corrupt .snap file on disk. -
The function
loadSnapin the same file, lines 140-158: On any Read error (including due to a partial file), renames the unreadable .snap to .snap.broken. No code ever removes or prunes .snap.broken files, so each broken snapshot accumulates indefinitely.
Recommended fix:
spath := filepath.Join(s.dir, fname)
tmpPath := spath + ".tmp"
// Write to temp file and fsync
err = pioutil.WriteAndSyncFile(tmpPath, d, 0666)
if err != nil {
s.lg.Warn("failed to write a snap temp file", zap.String("path", tmpPath), zap.Error(err))
os.Remove(tmpPath)
return err
}
// Atomically rename to final snapshot
if err = os.Rename(tmpPath, spath); err != nil {
s.lg.Warn("failed to rename snap file", zap.String("tmp-path", tmpPath), zap.String("path", spath), zap.Error(err))
return err
}
Please advise if further clarification is needed.