Proposal: stability enhancement during overload conditions
Problem Scenario
When an etcd server encounters a large number of write requests, and its applying rate is unable to process these requests in a timely manner (for example, when the Kubernetes API server writes a large number of Event objects, creates a large number of Pods in bulk, or unexpected batch Pods restarts occur), the server may return ErrTooManyRequests due to the current logic as below. This is understood as a protective mechanism. However, the current strategy can lead to catastrophic consequences.
func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*apply2.Result, error) {
if ci > ai+maxGapBetweenApplyAndCommitIndex {
return nil, errors.ErrTooManyRequests
}
}
The Kubernetes API server binds the storage of Event objects to Leases, aiming to automatically clean up associated Events once the Lease expires. Lease expiration is handled by the etcd leader polling the Leases and initiating LeaseRevoke requests. If the etcd cluster is in the aforementioned protective state at this time, the LeaseRevoke requests may have no chance to be executed (because Txn request is much more than LeaseRevoke), leading to a surge in the number of keys, further negatively affecting the apply rate, and ultimately causing etcd completely unavailable.
The core issue is that the current protection logic does not differentiate between user requests and internal system requests, resulting in indiscriminate rejection. When internal system requests cannot be executed, the system state may deteriorate, finally causing a system crash. LeaseRevoke is just one such internal system request. Compact requests have similar issues.
Proposal
- As discussed in #18175 , configurable
maxGapBetweenApplyAndCommitIndexwould be more flexible. - Within the protection logic, reserve some queue space specifically for critical requests, so that essential system requests (
LeaseRevoke,Compact) would have an chance to be executed under system pressure, preventing system crashes. Demo codes as below.
func (s *EtcdServer) processInternalRaftRequestOnce(ctx context.Context, r pb.InternalRaftRequest) (*apply2.Result, error) {
if isTooLargeGap(ai, ci, &r) {
return nil, errors.ErrTooManyRequests
}
}
func isTooLargeGap(ai, ci uint64, r *pb.InternalRaftRequest) bool {
isCriticalReq := r != nil && (r.Compact != nil || r.LeaseRevoke != nil)
// for normal request
if ci > ai+maxGapBetweenApplyAndCommitIndex && !isCriticalReq {
return true
}
// for system critical request, have a seperate 500 queue buffer.
if ci - ai > maxGapBetweenApplyAndCommitIndex + 500 && isCriticalReq {
return true
}
return false
}
reading through #18175, maybe we can simply lift the limit by default? 5000 seems very smallish for the amount of throughput etcd can handle nowadays, especially when writing many small key/values.
reading through #18175, maybe we can simply lift the limit by default? 5000 seems very smallish for the amount of throughput etcd can handle nowadays, especially when writing many small key/values.
Thanks for your reply. It seems dificault to determine a proper value for all kinds of workload. A larger value could let etcd server have more time to consume the committed log before rejecting all incoming requests. However if etcd server cannot speed up applying, for example, APIServers are listing all objects, the server will eventually crash due to no chance to do LeaseRevoke.
@ahrtr @serathius any suggestions?
2. Within the protection logic, reserve some queue space specifically for critical requests, so that essential system requests (
LeaseRevoke,Compact) would have an chance to be executed under system pressure, preventing system crashes. Demo codes as below
It seems a valid point to me, also a low-hang fruit. LeaseRevoke should be fine, but not sure about Compact. Note that etcdserver can't process any other requests during compaction.
- Within the protection logic, reserve some queue space specifically for critical requests, so that essential system requests (
LeaseRevoke,Compact) would have an chance to be executed under system pressure, preventing system crashes. Demo codes as belowIt seems a valid point to me, also a low-hang fruit.
LeaseRevokeshould be fine, but not sure aboutCompact. Note that etcdserver can't process any other requests during compaction.
Thanks @ahrtr . I am not sure about Compact either. Imaging etcd server missed serveral compactions, the outcomes may be 1. the db size quickly growing, which is not harmful
2. the treeIndex has many obsolete keyIndex, which may cuase OOM if the situation lasts for a long time.
@ahrtr Should I submit a PR for this? If so, I guess it should be put in the FeatureGate.
And what do you think about configurable maxGapBetweenApplyAndCommitIndex?
Pls bring both topic to our community meeting if possible.
One more comment for making exception for requests like LeaseRevoke, if there is already big gap between applied and commitIndex, it means any following raft requests won't be processed/applied until all the queued requests have been processed. But it might still make sense to make an exception for such requests. Yes, it should be under a featuregate.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.