etcd icon indicating copy to clipboard operation
etcd copied to clipboard

snap.broken files filled up the etcd disk space after a few restarting

Open stanleyu opened this issue 3 months ago • 3 comments

Bug report criteria

What happened?

We're running etcd v3.5 in the Kubernetes Pods, and the attached disks have limited size. The etcd cluster has 5 members and each runs in a separate Pod on different Kubernetes nodes. We restarted the Pods for a few times. After that, a few .snap.broken files were created which filled up the disk space, and thus the etcd services cannot start anymore.

==== #: etcd --version etcd Version: 3.5.21 Git SHA: a17edfd Go Version: go1.23.7 Go OS/Arch: linux/amd64

du -sh *

3.2G snap 800M wal

cd snap/

du -sh *

482M 0000000000003c97-0000000005ff12da.snap 487M 0000000000003c97-000000000600997b.snap 483M 0000000000003с97-0000000006022053.snap 489M 0000000000003c9f-000000000603a784.snap 211M 0000000000003c9f-000000000603a784.snap.broken 483M 0000000000003сa0-0000000006052e25.snap 490M 0000000000003ca0-000000000606b4c6.snap.broken 59M 00000000000003a1-000000000606b4f9.snap.broken 68K db

What did you expect to happen?

If I understand the etcd source code which creates the snap files correctly (server/etcdserver/api/snap/snapshotter.go), seems it can make incomplete snap files in some conditions. For example, if the process is stopped while a new snap file is being created, the file would be leftover as incomplete. Next time when the etcd process starts, it would not be able to load the partial file successfully thus would isolate it to be a .snap.broken file. The .snap.broken files would be skipped in future and won't be purged anymore.

In an enterprise class software, the approach of tmp+rename is usually used to create the critical files. That is, we firstly dump the file content in a temporary file on the same filesystem, then commit the file creation by renaming it with the destination file name. In this way, the incomplete file won't be loaded at all, and the leftover temporary file can be simply discarded automatically next time. I understand this can avoid the .snap.broken files significantly.

I'm not sure if this can be considered a request for enhancement or bug fix. Anyhow it's making trouble when the disk space for etcd is small, and especially the etcd server is containerized.

Hopefully this approach can be used in both v3.5 and newer versions.

How can we reproduce it (as minimally and precisely as possible)?

Keep restarting the etcd members forcibly for times.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.21
Git SHA: a17edfd
Go Version: go1.23.7
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.21

Etcd configuration (command line flags or environment variables)

This is the command-line flags for the first etcd member. There're 5 etcd members in the etcd cluster.

/usr/bin/etcd --data-dir=/etcd-data --name=sample-mds-1 --listen-peer-urls=https://0.0.0.0:2380 --listen-client-urls=https://0.0.0.0:2379 --advertise-client-urls=https://sample-mds-1.sample-mds.labsys:2379 --initial-advertise-peer-urls=https://sample-mds-1.sample-mds.labsys:2380 --initial-cluster=sample-mds-1=https://sample-mds-1.sample-mds.labsys:2380 --initial-cluster-state=new --initial-cluster-token=sample-mds-tok --peer-cert-file=/sample-config/certificates/tls.crt --peer-key-file=/sample-config/certificates/tls.key --peer-trusted-ca-file=/sample-config/certificates/ca.crt --peer-client-cert-auth --cert-file=/sample-config/certificates/tls.crt --key-file=/sample-config/certificates/tls.key --client-cert-auth --trusted-ca-file=/sample-config/certificates/ca.crt --enable-v2

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output


stanleyu avatar Sep 26 '25 20:09 stanleyu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 26 '25 00:11 github-actions[bot]

Any comment for this issue?

stanleyu avatar Nov 26 '25 16:11 stanleyu

I found the following code segment that seems to be linked to the reported behavior.

The root cause is the non-atomic snapshot write in save() combined with the one-way rename logic in loadSnap(), leading to unbounded accumulation of .snap.broken files.

Details:

  1. Function Snapshotter.save in server/etcdserver/api/snap/snapshotter.go, lines 75-104: Writes directly to final .snap filename via pioutil.WriteAndSyncFile. A crash or kill during write leaves a partial/corrupt .snap file on disk.

  2. The function loadSnap in the same file, lines 140-158: On any Read error (including due to a partial file), renames the unreadable .snap to .snap.broken. No code ever removes or prunes .snap.broken files, so each broken snapshot accumulates indefinitely.

Recommended fix:

spath := filepath.Join(s.dir, fname)
tmpPath := spath + ".tmp"
// Write to temp file and fsync
err = pioutil.WriteAndSyncFile(tmpPath, d, 0666)
if err != nil {
    s.lg.Warn("failed to write a snap temp file", zap.String("path", tmpPath), zap.Error(err))
    os.Remove(tmpPath)
    return err
}
// Atomically rename to final snapshot
if err = os.Rename(tmpPath, spath); err != nil {
    s.lg.Warn("failed to rename snap file", zap.String("tmp-path", tmpPath), zap.String("path", spath), zap.Error(err))
    return err
}

Please advise if further clarification is needed.

traincheck-team avatar Dec 05 '25 14:12 traincheck-team