Preserve needSync Flag on Region Cache Reload Failure to Reduce Error Spikes During Store Issues

Open HaoW30 opened this issue 6 months ago • 0 comments

Issue Description

Currently, when a region entry in the TiDB region cache is marked as needSync but fails to fetch updated region information from PD, the client is designed to fall back to returning stale region information instead of retrying or preserving the sync intent. Relevant code: https://github.com/tikv/client-go/blob/e84f1a780fa63c25b76b8813eee4d587904b8221/internal/locate/region_cache.go#L1541-L1551

This fallback mechanism works well in many scenarios—it helps ensure availability and allows some queries to proceed during transient PD issues.

However, it can cause higher error rates under certain conditions, especially for workloads with:

Strict max_execution_time settings.
TiKV store issues (e.g., EBS latency spikes or partial inaccessibility).
Many regions marked as needSync.

In these cases, requests may access stale region information, hit TiKV RPC errors, then timeout or fail without triggering region cache invalidation. For example, see: https://github.com/tikv/client-go/blob/e84f1a780fa63c25b76b8813eee4d587904b8221/internal/locate/replica_selector.go#L488-L492

If the region cache entry was already marked as needSync and failed the reload, returning the old region info without setting the needSync flag again can lead to repeated access to stale region info for subsequent requests. This results in elevated error rates and increased latency unnecessarily.

Proposed Enhancement

If a region cache entry is already marked as needSync, and the reload fails, the cache should retain or re-set the needSync flag. This ensures that subsequent requests are aware the region info is stale.

Impact

This change maintains the benefit of availability during PD glitches, while also improving stability and staleness visibility when store-level issues exist.

Reduces prolonged exposure to stale region data during store issues.
Improves query reliability under constrained timeout conditions.
Helps avoid unnecessary latency spikes and error accumulation.
Keeps existing behavior intact for transient PD unavailability.

May 29 '25 17:05 HaoW30