Problem Scenario

When an etcd server encounters a large number of write requests, and its applying rate is unable to process these requests in a timely manner (for example, when the Kubernetes API server writes a large number of Event objects, creates a large number of Pods in bulk, or unexpected batch Pods restarts occur), the server may return ErrTooManyRequests due to the current logic as below. This is understood as a protective mechanism. However, the current strategy can lead to catastrophic consequences.

func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*apply2.Result, error) {

	if ci > ai+maxGapBetweenApplyAndCommitIndex {
		return nil, errors.ErrTooManyRequests
	}
}

The Kubernetes API server binds the storage of Event objects to Leases, aiming to automatically clean up associated Events once the Lease expires. Lease expiration is handled by the etcd leader polling the Leases and initiating LeaseRevoke requests. If the etcd cluster is in the aforementioned protective state at this time, the LeaseRevoke requests may have no chance to be executed (because Txn request is much more than LeaseRevoke), leading to a surge in the number of keys, further negatively affecting the apply rate, and ultimately causing etcd completely unavailable.

The core issue is that the current protection logic does not differentiate between user requests and internal system requests, resulting in indiscriminate rejection. When internal system requests cannot be executed, the system state may deteriorate, finally causing a system crash. LeaseRevoke is just one such internal system request. Compact requests have similar issues.

Proposal

As discussed in #18175 , configurable maxGapBetweenApplyAndCommitIndex would be more flexible.
Within the protection logic, reserve some queue space specifically for critical requests, so that essential system requests (LeaseRevoke, Compact) would have an chance to be executed under system pressure, preventing system crashes. Demo codes as below.

func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*apply2.Result, error) {

	if isTooLargeGap(ai, ci, &r) {
		return nil, errors.ErrTooManyRequests
	}
}

func isTooLargeGap(ai, ci uint64, r *pb.InternalRaftRequest) bool {
	isCriticalReq := r != nil && (r.Compact != nil || r.LeaseRevoke != nil)
	// for normal request
	if ci > ai+maxGapBetweenApplyAndCommitIndex && !isCriticalReq {
		return true
	}
	// for system critical request, have a seperate 500 queue buffer.
	if ci - ai > maxGapBetweenApplyAndCommitIndex + 500 && isCriticalReq {
	    return true
	}

	return false
}

Jul 24 '25 09:07 silentred

reading through #18175, maybe we can simply lift the limit by default? 5000 seems very smallish for the amount of throughput etcd can handle nowadays, especially when writing many small key/values.

Jul 24 '25 09:07 tjungblu

reading through #18175, maybe we can simply lift the limit by default? 5000 seems very smallish for the amount of throughput etcd can handle nowadays, especially when writing many small key/values.

Thanks for your reply. It seems dificault to determine a proper value for all kinds of workload. A larger value could let etcd server have more time to consume the committed log before rejecting all incoming requests. However if etcd server cannot speed up applying, for example, APIServers are listing all objects, the server will eventually crash due to no chance to do LeaseRevoke.

Jul 25 '25 02:07 silentred

@ahrtr @serathius any suggestions?

Jul 28 '25 02:07 silentred

2. Within the protection logic, reserve some queue space specifically for critical requests, so that essential system requests (LeaseRevoke, Compact) would have an chance to be executed under system pressure, preventing system crashes. Demo codes as below

It seems a valid point to me, also a low-hang fruit. LeaseRevoke should be fine, but not sure about Compact. Note that etcdserver can't process any other requests during compaction.

Jul 28 '25 07:07 ahrtr

Within the protection logic, reserve some queue space specifically for critical requests, so that essential system requests (LeaseRevoke, Compact) would have an chance to be executed under system pressure, preventing system crashes. Demo codes as below

It seems a valid point to me, also a low-hang fruit. LeaseRevoke should be fine, but not sure about Compact. Note that etcdserver can't process any other requests during compaction.

Thanks @ahrtr . I am not sure about Compact either. Imaging etcd server missed serveral compactions, the outcomes may be 1. the db size quickly growing, which is not harmful 2. the treeIndex has many obsolete keyIndex, which may cuase OOM if the situation lasts for a long time.

Jul 28 '25 07:07 silentred

@ahrtr Should I submit a PR for this? If so, I guess it should be put in the FeatureGate. And what do you think about configurable maxGapBetweenApplyAndCommitIndex?

Jul 29 '25 02:07 silentred

Pls bring both topic to our community meeting if possible.

One more comment for making exception for requests like LeaseRevoke, if there is already big gap between applied and commitIndex, it means any following raft requests won't be processed/applied until all the queued requests have been processed. But it might still make sense to make an exception for such requests. Yes, it should be under a featuregate.

Jul 29 '25 11:07 ahrtr

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sep 28 '25 00:09 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Dec 06 '25 00:12 github-actions[bot]

Proposal: stability enhancement during overload conditions

Problem Scenario

Proposal