etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Reduce memory usage of etcd member catchup mechanism

Open serathius opened this issue 1 year ago • 25 comments

What would you like to be added?

All requests made to etcd are serialized into raft entry proto and persisted on disk WAL. That's good, but to allow slow/disconnected members to catchup etcd also stores last 10`000 entries in raft.InMemoryStorage, all loaded into memory. In some cases this can cause huge memory bloat of etcd. Imagine you have a sequence of large put requests (for example 1MB configmaps in Kubernetes). etcd will keep all 10GB in memory, doing nothing.

This can be reproduced by running ./bin/tools/benchmark put --total=1000 --val-size=1000000 and collecting inuse_space heap profile.

Selection_012

The mechanism is really dump and could benefit from following improvements:

  • Compact raft.InMemoryStorage after every apply instead of once every snapshot (10`000 entries is default snapshot frequency). With removal of v2storage we can switch to using applied index instead of snapshot index. As apply index is updated more frequently we can execute Compact more frequently, possibly after every apply assuming that it's not too costly,
  • Compact raft.InMemoryStorage based state of the slowest member. Why keep 5`000 entries (default catchup entries) in 1 node cluster? or there all members are up to date? We could read the state of the slowest member and Compact based on that
  • Tune the default snapshot catchup entries (5`000 entries). Current is based on 1ms latency and 10k throughput https://github.com/etcd-io/etcd/pull/2403/files. Would be good to revisit this and tune it. For example compare catchup times and availability for different sizes of WAL entries, DB file, latency, network throughput etc.
  • Change the semantic of catchup-entries from "we always store at least X entries" to "we store entries only for members that are behind by X entries max". If member is behind more then X entries, as it no longer makes sense to use raft entries to sync it. In this case we will use snapshot. So why keep those entries?

Why is this needed?

Prevent etcd memory bloating and make memory usage more predictable.

serathius avatar Dec 11 '23 22:12 serathius

cc @ahrtr

serathius avatar Dec 12 '23 12:12 serathius

I am interested in helping. But I am not sure I know exactly what needs to be done. Could I shadow someone or get some guidance if I were to attempt this? @serathius

moficodes avatar Dec 12 '23 15:12 moficodes

There are 4 different changes proposed. Looking at the first one Compact raft.InMemoryStorage after every apply instead of once every snapshot.. High level what we would want to do is:

  • Move the RaftLog compaction logic https://github.com/etcd-io/etcd/blob/dfdffe48f9e7c622f1b863d013754c2a824f6d35/server/etcdserver/server.go#L2097-L2125 to separate function compactRaftLog
  • Use etcdProgress.appliedi instead of snapi to decide compacti
  • Call the function from applyAll https://github.com/etcd-io/etcd/blob/dfdffe48f9e7c622f1b863d013754c2a824f6d35/server/etcdserver/server.go#L900-L920 after triggerSnapshot
  • Benchmark the result to confirm a performance result
  • Possibly limit the calls to comactRaftLog to be done only once per 100 or 1000 entries to avoid copying the raft log to frequently

serathius avatar Dec 13 '23 20:12 serathius

/assign

tangwz avatar Dec 25 '23 14:12 tangwz

There are 4 different changes proposed. Looking at the first one Compact raft.InMemoryStorage after every apply instead of once every snapshot.. High level what we would want to do is:

  • Move the RaftLog compaction logic https://github.com/etcd-io/etcd/blob/dfdffe48f9e7c622f1b863d013754c2a824f6d35/server/etcdserver/server.go#L2097-L2125 to separate function compactRaftLog
  • Use etcdProgress.appliedi instead of snapi to decide compacti
  • Call the function from applyAll https://github.com/etcd-io/etcd/blob/dfdffe48f9e7c622f1b863d013754c2a824f6d35/server/etcdserver/server.go#L900-L920 after triggerSnapshot
  • Benchmark the result to confirm a performance result
  • Possibly limit the calls to comactRaftLog to be done only once per 100 or 1000 entries to avoid copying the raft log to frequently

@serathius this snapi is already etcdProgress.appliedi from triggerSnapshot(ep *etcdProgress)

https://github.com/etcd-io/etcd/blob/main/server/etcdserver/server.go#L1168-L1185

tangwz avatar Jan 05 '24 14:01 tangwz

@tangwz Are you still planning to work on this?

serathius avatar Apr 03 '24 09:04 serathius

Hi @serathius , I could give this a shot, but I would like to understand the proposed changes a little better. Are all 4 changes necessary to address the problem? It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower (see https://github.com/etcd-io/etcd/blob/557e7f09df7f71e586e64b031043d57246842138/server/etcdserver/server.go#L1191). The code appears to already compact the log right after snapshotting, as you pasted above.

It sounds like change 2 by itself would address the problem in the common case where followers are able to keep up.

Together, changes 3 and 4 sound like they would also address the problem in a different way. If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind?

yipal avatar Apr 03 '24 22:04 yipal

r. Are all 4 changes necessary to address the problem?

I don't have full context as some time passed since I created the issue, still I think we need all the changes, first one to make compaction more frequent, second to improve cases where all members are healthy, third is needed to pick the best that the memory/ time to recovery tradeoff, fourth to better handle cases where one member is fully down. Feel free to add your own suggestions or ask more questions. The best way to reach me is on K8s slack https://github.com/etcd-io/etcd?tab=readme-ov-file#contact

It seems like the first change causes etcd to compact the log more frequently, but users can already tune the max length of the log by setting SnapshotCount to something lower.

The first case is not about having it lower, it's about making InMemoryStorage compaction independent from snapshots.

If we took the approach of change 3, and reduced SnapshotCatchUpEntries to be too small, does the existing code already send snapshots instead of entries to a follower who has fallen behind?

Yes

serathius avatar Apr 04 '24 08:04 serathius

/assign @clement2026

serathius avatar Jul 04 '24 12:07 serathius

@serathius: GitHub didn't allow me to assign the following users: clement2026.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @clement2026

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 04 '24 12:07 k8s-ci-robot