etcd icon indicating copy to clipboard operation
etcd copied to clipboard

etcdserver: separate raft log compact from snapshot

Open clement2026 opened this issue 1 year ago • 17 comments

This PR is part of #17098 (Reduce memory usage of etcd member catchup mechanism)

Purpose

Make raft log compaction independent from snapshots.

Key Updates

  • Added a new function compactRaftLog for raft log compaction.
  • Introduce 2 variables: RaftLogCompactionStep and DefaultRaftLogCompactionStep to manage how often the raft log is compacted.
  • Refactor some existing tests. These tests previously expected a compaction immediately after a snapshot to ensure a snapshot was sent to followers. This PR broke those tests, so I modified them to ensure they still work.
  • Add unit and integration tests for this PR.
  • [ ] Blocked by #18494

clement2026 avatar Aug 17 '24 06:08 clement2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clement2026 Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Aug 17 '24 06:08 k8s-ci-robot

Hi @clement2026. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 17 '24 06:08 k8s-ci-robot

Benchmark Summary

Here are the benchmark results for compacting the raft log every time applied index increases by 1, 10, 100, and 1000 (let's call this "step").

Branch Throughput
main(as the base) read -
write -
step_1 read [2.86%, 17.00%]
write [3.02%, 16.75%]
step_10 read [2.82%, 16.53%]
write [2.98%, 16.95%]
step_100 read [-2.09%, 9.77%]
write [-1.00%, 10.93%]
step_1000 read [-4.13%, 7.27%]
write [-4.91%, 7.65%]
main again read [-4.19%, 7.39%]
write [-4.41%, 9.03%]

Throughput performs best when the step is 1 or 10. There isn’t much difference between the two. I've set the default step value to 10 in the code.

clement2026 avatar Aug 17 '24 09:08 clement2026

Benchmark Details

Here's how I ran the benchmark:

  1. Run benchmark on main, step_1, step_10, step_100, step_1000 and main branch in that order.
  2. Reboot the OS between each run.
  3. Collect .csv files and generate the heat maps.

step_1 vs main

Compact raft log every time applied index increases by 1.

step_1_read

step_10 vs main

Compact raft log every time applied index increases by 10.

step_10_read

step_100 vs main

Compact raft log every time applied index increases by 100.

step_100_read

step_1000 vs main

Compact raft log every time applied index increases by 1000.

step_1000_read

main vs main

Re-run benchmarks on the main branch to ensure results are consistent

main_2_read

step_1 vs step_10

Since the benchmark results for step_1 and step_10 are pretty close, here I compare them to see the difference:

step_1-vs-step_10_read


GitHub workflow: https://github.com/clement2026/etcd-benchmark-action/actions/runs/10424407957

The archive containing .csv files and the script: etcd-benchmark-20240817-06-00-29.zip

The benchmark was run on a cloud VM with:

  • 16GB RAM
  • 8 vCPUs
  • 150GB NVMe SSD
  • Ubuntu 24.04 LTS x64

clement2026 avatar Aug 17 '24 09:08 clement2026

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 39.13043% with 28 lines in your changes missing coverage. Please review.

Project coverage is 68.89%. Comparing base (62e4433) to head (0003fa2). Report is 34 commits behind head on main.

:exclamation: Current head 0003fa2 differs from pull request most recent head 0e4d412

Please upload reports for the commit 0e4d412 to get more accurate results.

Files with missing lines Patch % Lines
client/v3/kubernetes/client.go 0.00% 27 Missing :warning:
server/embed/etcd.go 0.00% 0 Missing and 1 partial :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
server/config/config.go 79.76% <ø> (ø)
server/etcdserver/server.go 81.28% <100.00%> (-0.29%) :arrow_down:
server/features/etcd_features.go 60.00% <ø> (ø)
server/storage/mvcc/kvstore.go 89.76% <100.00%> (ø)
server/embed/etcd.go 75.77% <0.00%> (-0.05%) :arrow_down:
client/v3/kubernetes/client.go 0.00% <0.00%> (ø)

... and 22 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18459      +/-   ##
==========================================
+ Coverage   68.79%   68.89%   +0.10%     
==========================================
  Files         420      420              
  Lines       35490    35482       -8     
==========================================
+ Hits        24416    24446      +30     
+ Misses       9647     9613      -34     
+ Partials     1427     1423       -4     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 62e4433...0e4d412. Read the comment docs.

codecov-commenter avatar Aug 17 '24 13:08 codecov-commenter

/ok-to-test

jmhbnz avatar Aug 18 '24 07:08 jmhbnz

/retest pull-etcd-robustness-arm64

clement2026 avatar Aug 18 '24 09:08 clement2026

@clement2026: The /retest command does not accept any targets. The following commands are available to trigger required jobs:

  • /test pull-etcd-build
  • /test pull-etcd-e2e-386
  • /test pull-etcd-e2e-amd64
  • /test pull-etcd-e2e-arm64
  • /test pull-etcd-govulncheck
  • /test pull-etcd-integration-1-cpu-amd64
  • /test pull-etcd-integration-2-cpu-amd64
  • /test pull-etcd-integration-4-cpu-amd64
  • /test pull-etcd-unit-test-386
  • /test pull-etcd-unit-test-amd64
  • /test pull-etcd-unit-test-arm64
  • /test pull-etcd-verify

The following commands are available to trigger optional jobs:

  • /test pull-etcd-robustness-amd64
  • /test pull-etcd-robustness-arm64

Use /test all to run all jobs.

In response to this:

/retest pull-etcd-robustness-arm64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 18 '24 09:08 k8s-ci-robot

/test pull-etcd-robustness-arm64

clement2026 avatar Aug 18 '24 09:08 clement2026

/test pull-etcd-robustness-arm64

clement2026 avatar Aug 18 '24 10:08 clement2026

Reason for the failure

The robustness test failed at panic("need non-empty snapshot") in the raft library.

This happens before the leader creates any snapshot, and raftStorage.ents has already been compacted and lacks the entries a slow follower needs. So, the slow follower requests a snapshot instead, causing the leader to panic.

https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L662C1-L689

// maybeSendSnapshot fetches a snapshot from Storage, and sends it to the given
// node. Returns true iff the snapshot message has been emitted successfully.
func (r *raft) maybeSendSnapshot(to uint64, pr *tracker.Progress) bool {
	if !pr.RecentActive {
		r.logger.Debugf("ignore sending snapshot to %x since it is not recently active", to)
		return false
	}


	snapshot, err := r.raftLog.snapshot()
	if err != nil {
		if err == ErrSnapshotTemporarilyUnavailable {
			r.logger.Debugf("%x failed to send snapshot to %x because snapshot is temporarily unavailable", r.id, to)
			return false
		}
		panic(err) // TODO(bdarnell)
	}
	if IsEmptySnap(snapshot) {
		panic("need non-empty snapshot")
	}
	sindex, sterm := snapshot.Metadata.Index, snapshot.Metadata.Term
	r.logger.Debugf("%x [firstindex: %d, commit: %d] sent snapshot[index: %d, term: %d] to %x [%s]",
		r.id, r.raftLog.firstIndex(), r.raftLog.committed, sindex, sterm, to, pr)
	pr.BecomeSnapshot(sindex)
	r.logger.Debugf("%x paused sending replication messages to %x [%s]", r.id, to, pr)


	r.send(pb.Message{To: to, Type: pb.MsgSnap, Snapshot: &snapshot})
	return true
}

Visualization

Here are some images to help us understand the situation better.

Main branch

Compaction only happens right after a snapshot is created.

Untitled-2024-08-16-18062

This PR

Compaction can happen before a snapshot is created.

Untitled-2024-08-16-1806 3

Solutions

  • Modify the implementation of func (r *raft) maybeSendSnapshot(to uint64, pr *tracker.Progress) bool to send an empty snapshot instead of causing a panic. However, I’m not sure about the expected behavior or potential risks.
  • Ensure a snapshot is created if it doesn’t exist when etcd starts up. This might sound silly, I know 😂

I'm stumped right now. Any ideas would be awesome!

clement2026 avatar Aug 18 '24 21:08 clement2026

/test pull-etcd-robustness-arm64

clement2026 avatar Aug 21 '24 22:08 clement2026

/test pull-etcd-robustness-arm64

clement2026 avatar Aug 22 '24 04:08 clement2026

/test pull-etcd-robustness-arm64

clement2026 avatar Aug 23 '24 07:08 clement2026

Ensure a snapshot is created if it doesn’t exist when etcd starts up. This might sound silly, I know

Sounds reasonable. We are separating mechanisms for raft log compaction and snapshots, meaning that now we need to guarantee that snapshot is always available. Creating empty snapshot at the etcd start sounds like a good first iteration.

serathius avatar Aug 23 '24 09:08 serathius

Sounds reasonable. We are separating mechanisms for raft log compaction and snapshots, meaning that now we need to guarantee that snapshot is always available. Creating empty snapshot at the etcd start sounds like a good first iteration.

Thank you for the info! I'll give this a shot and open a new PR soon.

clement2026 avatar Aug 23 '24 09:08 clement2026

@clement2026: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-robustness-arm64 0e4d4128185ee6246fc2241bdacdcdec2a543a5a link false /test pull-etcd-robustness-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar Aug 31 '24 19:08 k8s-ci-robot

Compaction can happen before a snapshot is created.

Untitled-2024-08-16-1806 3

leader sending a snapshot means that the leader sends the whole bbolt db snapshot. It doesn't care about what's the current applied index the follower is.

ahrtr avatar Sep 13 '24 13:09 ahrtr

@ahrtr the image isn't loading; It's not set to public: https://private-user-images.githubusercontent.com/27894831/358932866-df2dbdb7-2165-492c-93b8-907e09651fd4.png

leader sending a snapshot means that the leader sends the whole bbolt db snapshot. It doesn't care about what's the current applied index the follower is.

Though the leader doesn't care about the follower, the leader itself can panic before sending out the snapshot.

Before sending a snapshot to a follower, the leader runs maybeSendSnapshot() and may panic here:

https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L678-L680

	if IsEmptySnap(snapshot) {
		panic("need non-empty snapshot")
	}

All the leader wants is actually a bbolt db snapshot, not a raft log snapshot. But it seems I can't walk around maybeSendSnapshot() as it's part of the raft mechanism.

clement2026 avatar Sep 13 '24 14:09 clement2026

Though the leader doesn't care about the follower, the leader itself can panic before sending out the snapshot.

Before sending a snapshot to a follower, the leader runs maybeSendSnapshot() and may panic here:

https://github.com/etcd-io/raft/blob/ba8d7e340d530ae106118f059d3da4d89f933d23/raft.go#L678-L680

	if IsEmptySnap(snapshot) {
		panic("need non-empty snapshot")
	}

It's concerning we mix a potential bug with enhancement. Can you raise a separate issue to clearly clarify how to reproduce the issue?

ahrtr avatar Sep 13 '24 15:09 ahrtr

It's concerning we mix a potential bug with enhancement. Can you raise a separate issue to clearly clarify how to reproduce the issue?

Sure, I'll create the issue a bit later.

clement2026 avatar Sep 13 '24 16:09 clement2026

After chatting with serathius, I've realized this PR isn’t needed anymore. I'll keep working on https://github.com/etcd-io/etcd/issues/17098 in a new PR. Closing this.

clement2026 avatar Sep 23 '24 13:09 clement2026