etcd feature: add new compactor based revision count

What would you like to be added?

Add new compactor based revision count, instead of fixed interval time.

In order to make it happen, the mvcc store needs to export CompactNotify function to notify the compactor that configured number of write transactions have occured since previsious compaction. The new compactor can get the revision change and delete out-of-date data in time, instead of waiting with fixed interval time. The underly bbolt db can reuse the free pages as soon as possible.

Why is this needed?

In the kubernetes cluster, for instance, argo workflow, there will be batch requests to create pods , and then there are also a lot of pod status's PATCH requests, especially when the pod has more than 3 containers. If the burst requests increase the db size in short time, it will be easy to exceed the max quota size. And then the cluster admin get involved to defrag, which may casue long downtime. So, we hope the ETCD can delete the out-of-date data as soon as possible and slow down the grow of total db size.

Currently, both revision and periodic are based on time. It's not easy to use fixed interval time to face the unexpected burst update requests. The new compactor based on revision count can make the admin life easier. For instance, let's say that average of object size is 50 KiB. The new compactor will compact based on 10,000 revisions. It's like that ETCD can compact after new 500 MiB data in, no matter how long ETCD takes to get new 10,000 revisions. It can handle the burst update requests well.

There are some test results:

Fixed value size: 10 KiB, Update Rate: 100/s, Total key space: 3,000

enchmark put --rate=100 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240

Compactor	DB Total Size	DB InUse Size
Revision(5min,retension:10000)	570 MiB	208 MiB
Periodic(1m)	232 MiB	165 MiB
Periodic(30s)	151 MiB	127 MiB
NewRevision(retension:10000)	195 MiB	187 MiB

Random value size: [9 KiB, 11 KiB], Update Rate: 150/s, Total key space: 3,000

bnchmark put --rate=150 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240 \
  --delta-val-size=1024

Compactor	DB Total Size	DB InUse Size
Revision(5min,retension:10000)	718 MiB	554 MiB
Periodic(1m)	297 MiB	246 MiB
Periodic(30s)	185 MiB	146 MiB
NewRevision(retension:10000)	186 MiB	178 MiB

Random value size: [6 KiB, 14 KiB], Update Rate: 200/s, Total key space: 3,000

bnchmark put --rate=200 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240 \
  --delta-val-size=4096

Compactor	DB Total Size	DB InUse Size
Revision(5min,retension:10000)	874 MiB	221 MiB
Periodic(1m)	357 MiB	260 MiB
Periodic(30s)	215 MiB	151 MiB
NewRevision(retension:10000)	182 MiB	176 MiB

For the burst requests, we needs to use short periodic interval. Otherwise, the total size will be large. I think the new compactor can handle it well. And the cluster admin can configure it based on the payload size easily.

Additional Change:

Currently, the quota system only checks DB total size. However, there could be a lot of free pages which can be reused to upcoming requests. Based on this proposal, I also want to extend current quota system with DB's InUse size.

If the InUse size is less than max quota size, we should allow requests to update. Since the bbolt might be resized if there is no available continuous pages, we should setup a hard limit for the overflow, like 1 GiB.

 // Quota represents an arbitrary quota against arbitrary requests. Each request
@@ -130,7 +134,17 @@ func (b *BackendQuota) Available(v interface{}) bool {
                return true
        }
        // TODO: maybe optimize Backend.Size()
-       return b.be.Size()+int64(cost) < b.maxBackendBytes
+
+       // Since the compact comes with allocatable pages, we should check the
+       // SizeInUse first. If there is no continuous pages for key/value and
+       // the boltdb continues to resize, it should not increase more than 1
+       // GiB. It's hard limitation.
+       //
+       // TODO: It should be enabled by flag.
+       if b.be.Size()+int64(cost)-b.maxBackendBytes >= maxAllowedOverflowBytes(b.maxBackendBytes) {
+               return false
+       }
+       return b.be.SizeInUse()+int64(cost) < b.maxBackendBytes
 }

And it's likely to disable NO_SPACE alarm if the compact can get much more free pages. It can reduce downtime.

Demo: https://github.com/etcd-io/etcd/pull/16427

cc @ahrtr @serathius @wenjiaswe @jmhbnz @chaochn47

Aug 16 '23 15:08 fuweid

The existing CompactorModeRevision can also meet your requirement. We just need to set the same value as the revisionThreshold in your case.

However, there could be a lot of free pages which can be reused to upcoming requests. Based on this proposal, I also want to extend current quota system with DB's InUse size.

I am afraid It isn't correct. Free pages can't be reused before they are reclaimed (by defragmentation).

Aug 16 '23 18:08 ahrtr

Thanks @ahrtr for the quick comment.

The existing CompactorModeRevision can also meet your requirement. We just need to set the same value as the revisionThreshold in your case.

The fact is that it is not. This runs every 5-minute if enough of logs have proceeded. That's why I make this proposal.

https://github.com/etcd-io/etcd/blob/0d89fa73362f2dbce5ba843e3ac10b16c20d7ab9/server/etcdserver/api/v3compactor/revision.go#L61-L78

I am afraid It isn't correct. Free pages can't be reused before they are reclaimed (by defragmentation).

If I understand correctly, Defragmentation is to replay all the key/values into new db file. The defrag function just goes through all the buckets and copy all the keys into new db file. It doesn't delete any things. But compactor does. The defrag is used to reduce the total db size if there are a lot of free pages.

https://github.com/etcd-io/etcd/blob/0d89fa73362f2dbce5ba843e3ac10b16c20d7ab9/server/storage/backend/backend.go#L563-L620

For example, setup single ETCD server from scratch and disable auto compactor.

step 1: use following command to ingest 10,000 revisions.

benchmark put --rate=100 --total=10000 --compact-interval=0 --key-space-size=3000 --key-size=256 --val-size=10240

We will get: total size is 136 MiB and InUse size is 123 MiB

etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        |    VERSION    | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 |           3.6.0 |  136 MB |         123 MB |      true |      false |         2 |      10004 |              10004 |        |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

step 2: use defrag

There is just few MiB as free pages. The total size will be 123 MiB.

 etcdctl defrag dc1e6cd4c757f755 -w table
Finished defragmenting etcd member[127.0.0.1:2379]. took 2.995811265s

etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        |    VERSION    | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 |           3.6.0 |  123 MB |         123 MB |      true |      false |         2 |      10004 |              10004 |        |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

It's expected because defrag doesn't delete any things and it just replays all the key/values.

step 3: use compact

etcdctl get foo -w json
{"header":{"cluster_id":5358838169441251993,"member_id":15861234598778763093,"revision":10001,"raft_term":2}}


etcdctl compact 10001
compacted revision 10001

 etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        |    VERSION    | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 |           3.6.0 |  124 MB |          36 MB |      true |      false |         2 |      10005 |              10005 |        |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

After compact, we get total free page size is 88 MiB. Actually, it can be reused.

step 4: ingest new 1000 revision

And the InUse size is increased to 48 MiB from 36 MiB. NOTE: if there is not much more continuous pages, the total size will be increased.

benchmark put --rate=100 --total=1000 --compact-interval=0 --key-space-size=3000 --key-size=128 --val-size=10240

etcdctl endpoint status dc1e6cd4c757f755 -w table
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        |    VERSION    | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | dc1e6cd4c757f755 | 3.6.0-alpha.0 |           3.6.0 |  124 MB |          48 MB |      true |      false |         2 |      11005 |              11005 |        |
+----------------+------------------+---------------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

The compactor deletes out-of-date revisions, and the free pages can be reused. In my first comment, the new compactor can delete out-of-date in time can slow down the increase rate of total size.

Hope the comment can make it clear. :)

Aug 17 '23 06:08 fuweid

Not super convinced, but I overall don't understand need for revision based compaction, so I might be wrong.

There is a tradeoff between predictability of compaction window vs storage size, however I don't know if anyone would prefer storage size especially for kubernetes. If there is a burst of changes, you especially don't want to compact too fast because it risks clients not being able to catch up. Fast compaction will lead to unsynced clients, that will need to download the whole state again. Seems like not a good change from scalability standpoint.

cc @mborsz @wojtek-t as for opinion about revision based compaction.

Aug 17 '23 10:08 serathius

@fuweid sorry not to make it clear.

One of the key points is not to compact the most recent revisions too fast just as @serathius mentioned above.
Yes, that's correct that the free pages in bbolt db can be reused.
With regard to the performance issue (e.g. OOM) caused by burst traffic, there are already a couple of related discussion (unfortunately we do not get time to take care of it so far. @Geetasg)
I am not worry about the db size as long as the free pages can be reused later. FYI. https://github.com/etcd-io/bbolt/issues/422

Aug 17 '23 11:08 ahrtr

Thanks for the review @serathius @ahrtr

The painpoint (at least to me) is that the bustable traffic is unpredictable and there is no watchable alerm API (if I understand correctly) to notify that ETCD Operator or admin to compact it in time and defrag if the DB size exceeds the quota in short time. The ETCD cluster would run into read-only mode, which is kind of unavailable issue.

So, I am seeking a way to compact old revisions in time. It's kind of revision garbage collector. The solution is to introduce the compactor based on revision count, which I was thinking about and it's doable. It can be configurable by cluster object size.

you especially don't want to compact too fast because it risks clients not being able to catch up.

Yes. For this issue, we can consider to introduce catch-up-revisions to keep new revisions and avoid too many relist calls. Does it make sense to you?

With regard to the performance issue (e.g. OOM) caused by burst traffic, there are already a couple of related discussion

Yes. Burst list-all-huge-dataset traffic is nightmare.

I am not worry about the db size as long as the free pages can be reused later. FYI https://github.com/etcd-io/bbolt/issues/422

Yes. But it needs us to gc the old revisions in time to reclaim the pages. Otherwise, the total db size exceeds the quota size and ETCD would file NO_SPACE alert and get into read-only.

Based on that, maybe we should consider that the quota available checker should consider InUseSize at first and then the total size, since there will be available free pages can be reused after compact.

Looking forward to your feedback.

Aug 17 '23 13:08 fuweid

The painpoint (at least to me) is that the bustable traffic is unpredictable and there is no watchable alerm API (if I understand correctly) to notify that ETCD Operator or admin to compact it in time and defrag if the DB size exceeds the quota in short time. The ETCD cluster would run into read-only mode, which is kind of unavailable issue.

Not sure I understand your motivation. It's normal to overprovision disk. What kind of burst you are talking about? Are you expecting data size you double in 5 minutes?

Aug 18 '23 08:08 serathius

What kind of burst you are talking about? Are you expecting data size you double in 5 minutes?

The burst is about too many PUT requests to kube-apiserver. I was facing issues that the argo workflow submitted the cronjobs and the ETCD total db size was increased by 2-3 GiB in 5 minutes. The number of pods was about 3,000 ~ 6,000 and each pod has one init-container and two containers. There were at least 5 PUT requests to ETCD for each pod. And most of pods were short-live. But workflow job kept create/delete/re-create for hours. If the compact doesn't work in time, the bbolt would expand very fast and then it exceeds the quota.

Not sure I understand your motivation. It's normal to overprovision disk.

Sorry for unclear comment. My motivation is to make ETCD clear the old revisions as soon as possible and then ETCD can reuse the free page, which can reduce downtime caused by NO_SPACE alarm.

Currently, once ETCD server detects the bbolt db size exceeds the max quota size, the ETCD server files NO_SPACE alarm and then gets into read-only mode. Even if the compactor reclaims space which has enough continuous pages to serve most of upcoming requests, the ETCD server still denies the request. The server needs the defragment to decrease the db size which lower than the max quota. And disable the NO_SPACE alarm.

However, there is no API, like Watch for change of key/value, to notify the operator or admin about the NO_SPACE alarm. It needs the operator component to poll the Alarm list. In order to reduce the downtime cause by NO_SPACE, there are only two options for existing ETCD releases:

Short the compact interval
Short the interval to poll the NO_SPACE alarm

So, I make this proposal to compact the old revisions in time and keep the DB total size is smaller than max quota size.

Besides this proposal, maybe we can consider to use In-Use-Size as current usage, instead of bbolt total db size. Only if the In-Use-Size exceeds the quota size, the ETCD gets into read-only mode. Because the bbolt total db size doesn't represents the real usage size. The quota available checker should use In-Use-Size + cost > max-quota-size to deny the update request. So, ETCD server can be back to normal from NO_SPACE after compact, even if there is burst requests flooding into the ETCD server.

Aug 18 '23 10:08 fuweid

We're also seeing those bursty workloads more and more often, eg with argo and batch-style operations.

So, I am seeking a way to compact old revisions in time. It's kind of revision garbage collector.

I wonder if we can make the MVCC and bbolt more "generational". What if we were to shard bbolt by revision range? I think this would also help with the fragmentation and the slowness we observe when writing into almost full bbolt, as new writes would always go to a new bbolt database.

Alternatively, have one "permgen"-style bbolt file where we can keep historical revisions that don't change, new writes would go into a "newgen". Where we would occasionally copy older revisions into the permgen, and then recycle the newgen once the fragmentation would be too big.

Admittedly, we would lose the ootb transaction support of one bbolt file, which is really neat.

May 30 '24 09:05 tjungblu

+1

Jun 18 '24 05:06 lance5890

This feature looks pretty useful from cluster administrator perspective.

We can maintain/keep the steady state 5mins events window for watchers but when db size / db size in use is closer to quota, say 80%, enable the revision based compactor since the trigger is pretty responsive.

Also supportive of changing the NO_SPACE alarm to fire based on db size in use.

ref. AWS posts a blog about managing db size and there are quite some examples / use cases why the feature could be useful for admins of etcd. https://aws.amazon.com/blogs/containers/managing-etcd-database-size-on-amazon-eks-clusters/

Mar 05 '25 18:03 chaochn47

Is it possible to provide multiple modes for compaction? We don't have to change the default behavior of interval based compaction, but this feature seems to be desirable for some important customers. How about adding the feature behind a feature gate? The whole idea of feature gate is so that we can experiment with new features easier.

On the other hand, if we want to proceed with this new feature, can we get more contribution commitment from AWS or other customers so that the extra work and maintenance effort does not result in net increase of workload for existing maintainers and oss contributors?

Mar 05 '25 19:03 siyuanfoundation

Is it possible to provide multiple modes for compaction? We don't have to change the default behavior of interval based compaction, but this feature seems to be desirable for some important customers. How about adding the feature behind a feature gate? The whole idea of feature gate is so that we can experiment with new features easier.

I was thinking we can go with adding new mode and feature gate for it.

I can repost it in community meeting once we release v3.6

Mar 06 '25 01:03 fuweid

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Jun 08 '25 00:06 github-actions[bot]

/cc

Jun 09 '25 03:06 hwdef

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Aug 09 '25 00:08 github-actions[bot]

closed it due to no bandwidth. feel free to reopen it if you want to carry it.

Aug 28 '25 15:08 fuweid

Hi @fuweid , I am quite interested in this idea. I have one concern.

to introduce catch-up-revisions to keep new revisions and avoid too many relist calls.

By catch-up-revisions, did you mean pick a appropriate value for revision count? Or did you mean recording the slowest watcher's minRev as catch-up-revisions, and compaction revision should be forced to be less than it? If so, the slowest watcher may cause compaction not quite in time, leading to a burst of db size.

I'am maintaining a huge etcd cluster. Its in-use data is about 50GB, and total data is near 100GB. Less than 50% usage of pages and too-big db size is a headache to me. I plan to test this new compactor to see if it could make better use of pages.

BTW, I am not supportive to use in-use db size for disk quota. Because admins should be able to limit the total db size. A big db size will cause big trouble for maintaining a cluster and may have nagtive effects to bbolt performance. For example, when adding a new member to a big cluster, the new member perhaps cannot catch up leader after it downloads a big snapshot db.

I'am also seeking other ways to make db small, e.g. compressing value before writing to boltdb.

Nov 28 '25 04:11 silentred