etcd
etcd copied to clipboard
etcdserver: separate raft log compact from snapshot
This PR is part of #17098 (Reduce memory usage of etcd member catchup mechanism)
Purpose
Make raft log compaction independent from snapshots.
Key Updates
- Added a new function
compactRaftLogfor raft log compaction. - Introduce 2 variables:
RaftLogCompactionStepandDefaultRaftLogCompactionStepto manage how often the raft log is compacted. - Refactor some existing tests. These tests previously expected a compaction immediately after a snapshot to ensure a snapshot was sent to followers. This PR broke those tests, so I modified them to ensure they still work.
- Add unit and integration tests for this PR.
- [ ] Blocked by #18494
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: clement2026 Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Hi @clement2026. Thanks for your PR.
I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Benchmark Summary
Here are the benchmark results for compacting the raft log every time applied index increases by 1, 10, 100, and 1000 (let's call this "step").
| Branch | Throughput | |
|---|---|---|
| main(as the base) | read | - |
| write | - | |
| step_1 | read | [2.86%, 17.00%] |
| write | [3.02%, 16.75%] | |
| step_10 | read | [2.82%, 16.53%] |
| write | [2.98%, 16.95%] | |
| step_100 | read | [-2.09%, 9.77%] |
| write | [-1.00%, 10.93%] | |
| step_1000 | read | [-4.13%, 7.27%] |
| write | [-4.91%, 7.65%] | |
| main again | read | [-4.19%, 7.39%] |
| write | [-4.41%, 9.03%] |
Throughput performs best when the step is 1 or 10. There isn’t much difference between the two. I've set the default step value to 10 in the code.
Benchmark Details
Here's how I ran the benchmark:
- Run benchmark on
main,step_1,step_10,step_100,step_1000andmainbranch in that order. - Reboot the OS between each run.
- Collect .csv files and generate the heat maps.
step_1 vs main
Compact raft log every time applied index increases by 1.
step_10 vs main
Compact raft log every time applied index increases by 10.
step_100 vs main
Compact raft log every time applied index increases by 100.
step_1000 vs main
Compact raft log every time applied index increases by 1000.
main vs main
Re-run benchmarks on the main branch to ensure results are consistent
step_1 vs step_10
Since the benchmark results for step_1 and step_10 are pretty close, here I compare them to see the difference:
GitHub workflow: https://github.com/clement2026/etcd-benchmark-action/actions/runs/10424407957
The archive containing .csv files and the script: etcd-benchmark-20240817-06-00-29.zip
The benchmark was run on a cloud VM with:
- 16GB RAM
- 8 vCPUs
- 150GB NVMe SSD
- Ubuntu 24.04 LTS x64
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
Attention: Patch coverage is 39.13043% with 28 lines in your changes missing coverage. Please review.
Project coverage is 68.89%. Comparing base (
62e4433) to head (0003fa2). Report is 34 commits behind head on main.
:exclamation: Current head 0003fa2 differs from pull request most recent head 0e4d412
Please upload reports for the commit 0e4d412 to get more accurate results.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| client/v3/kubernetes/client.go | 0.00% | 27 Missing :warning: |
| server/embed/etcd.go | 0.00% | 0 Missing and 1 partial :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
| Files with missing lines | Coverage Δ | |
|---|---|---|
| server/config/config.go | 79.76% <ø> (ø) |
|
| server/etcdserver/server.go | 81.28% <100.00%> (-0.29%) |
:arrow_down: |
| server/features/etcd_features.go | 60.00% <ø> (ø) |
|
| server/storage/mvcc/kvstore.go | 89.76% <100.00%> (ø) |
|
| server/embed/etcd.go | 75.77% <0.00%> (-0.05%) |
:arrow_down: |
| client/v3/kubernetes/client.go | 0.00% <0.00%> (ø) |
... and 22 files with indirect coverage changes
@@ Coverage Diff @@
## main #18459 +/- ##
==========================================
+ Coverage 68.79% 68.89% +0.10%
==========================================
Files 420 420
Lines 35490 35482 -8
==========================================
+ Hits 24416 24446 +30
+ Misses 9647 9613 -34
+ Partials 1427 1423 -4
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 62e4433...0e4d412. Read the comment docs.
/ok-to-test
/retest pull-etcd-robustness-arm64
@clement2026: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:
/test pull-etcd-build/test pull-etcd-e2e-386/test pull-etcd-e2e-amd64/test pull-etcd-e2e-arm64/test pull-etcd-govulncheck/test pull-etcd-integration-1-cpu-amd64/test pull-etcd-integration-2-cpu-amd64/test pull-etcd-integration-4-cpu-amd64/test pull-etcd-unit-test-386/test pull-etcd-unit-test-amd64/test pull-etcd-unit-test-arm64/test pull-etcd-verify
The following commands are available to trigger optional jobs:
/test pull-etcd-robustness-amd64/test pull-etcd-robustness-arm64
Use /test all to run all jobs.
In response to this:
/retest pull-etcd-robustness-arm64
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/test pull-etcd-robustness-arm64
/test pull-etcd-robustness-arm64
Reason for the failure
The robustness test failed at panic("need non-empty snapshot") in the raft library.
This happens before the leader creates any snapshot, and raftStorage.ents has already been compacted and lacks the entries a slow follower needs. So, the slow follower requests a snapshot instead, causing the leader to panic.
https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L662C1-L689
// maybeSendSnapshot fetches a snapshot from Storage, and sends it to the given
// node. Returns true iff the snapshot message has been emitted successfully.
func (r *raft) maybeSendSnapshot(to uint64, pr *tracker.Progress) bool {
if !pr.RecentActive {
r.logger.Debugf("ignore sending snapshot to %x since it is not recently active", to)
return false
}
snapshot, err := r.raftLog.snapshot()
if err != nil {
if err == ErrSnapshotTemporarilyUnavailable {
r.logger.Debugf("%x failed to send snapshot to %x because snapshot is temporarily unavailable", r.id, to)
return false
}
panic(err) // TODO(bdarnell)
}
if IsEmptySnap(snapshot) {
panic("need non-empty snapshot")
}
sindex, sterm := snapshot.Metadata.Index, snapshot.Metadata.Term
r.logger.Debugf("%x [firstindex: %d, commit: %d] sent snapshot[index: %d, term: %d] to %x [%s]",
r.id, r.raftLog.firstIndex(), r.raftLog.committed, sindex, sterm, to, pr)
pr.BecomeSnapshot(sindex)
r.logger.Debugf("%x paused sending replication messages to %x [%s]", r.id, to, pr)
r.send(pb.Message{To: to, Type: pb.MsgSnap, Snapshot: &snapshot})
return true
}
Visualization
Here are some images to help us understand the situation better.
Main branch
Compaction only happens right after a snapshot is created.
This PR
Compaction can happen before a snapshot is created.
Solutions
- Modify the implementation of
func (r *raft) maybeSendSnapshot(to uint64, pr *tracker.Progress) boolto send an empty snapshot instead of causing a panic. However, I’m not sure about the expected behavior or potential risks. - Ensure a snapshot is created if it doesn’t exist when etcd starts up. This might sound silly, I know 😂
I'm stumped right now. Any ideas would be awesome!
/test pull-etcd-robustness-arm64
/test pull-etcd-robustness-arm64
/test pull-etcd-robustness-arm64
Ensure a snapshot is created if it doesn’t exist when etcd starts up. This might sound silly, I know
Sounds reasonable. We are separating mechanisms for raft log compaction and snapshots, meaning that now we need to guarantee that snapshot is always available. Creating empty snapshot at the etcd start sounds like a good first iteration.
Sounds reasonable. We are separating mechanisms for raft log compaction and snapshots, meaning that now we need to guarantee that snapshot is always available. Creating empty snapshot at the etcd start sounds like a good first iteration.
Thank you for the info! I'll give this a shot and open a new PR soon.
@clement2026: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-etcd-robustness-arm64 | 0e4d4128185ee6246fc2241bdacdcdec2a543a5a | link | false | /test pull-etcd-robustness-arm64 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
Compaction can happen before a snapshot is created.
leader sending a snapshot means that the leader sends the whole bbolt db snapshot. It doesn't care about what's the current applied index the follower is.
@ahrtr the image isn't loading; It's not set to public: https://private-user-images.githubusercontent.com/27894831/358932866-df2dbdb7-2165-492c-93b8-907e09651fd4.png
leader sending a snapshot means that the leader sends the whole bbolt db snapshot. It doesn't care about what's the current applied index the follower is.
Though the leader doesn't care about the follower, the leader itself can panic before sending out the snapshot.
Before sending a snapshot to a follower, the leader runs maybeSendSnapshot() and may panic here:
https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L678-L680
if IsEmptySnap(snapshot) {
panic("need non-empty snapshot")
}
All the leader wants is actually a bbolt db snapshot, not a raft log snapshot. But it seems I can't walk around maybeSendSnapshot() as it's part of the raft mechanism.
Though the leader doesn't care about the follower, the leader itself can panic before sending out the snapshot.
Before sending a snapshot to a follower, the leader runs
maybeSendSnapshot()and may panic here:https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L678-L680
if IsEmptySnap(snapshot) { panic("need non-empty snapshot") }
It's concerning we mix a potential bug with enhancement. Can you raise a separate issue to clearly clarify how to reproduce the issue?
It's concerning we mix a potential bug with enhancement. Can you raise a separate issue to clearly clarify how to reproduce the issue?
Sure, I'll create the issue a bit later.
After chatting with serathius, I've realized this PR isn’t needed anymore. I'll keep working on https://github.com/etcd-io/etcd/issues/17098 in a new PR. Closing this.
