client-go icon indicating copy to clipboard operation
client-go copied to clipboard

Inconsistent store pointers between the region-cache and store-cache cause stale regions to become inaccessible.

Open AndreMouche opened this issue 1 week ago • 1 comments

I believe this is a bug related to inconsistent state between region-cache and store-cache when a TiKV store updates its address or labels.

https://github.com/tikv/client-go/blob/01758810e8419b784c0b652ad32ef03664df50bd/internal/locate/store_cache.go#L494-L521

From the above code, when we update the address or lable of a TiKV instance, a new store will be created and replace the old one in store-cache, we can confirm this by the log

 store address or labels changed, add new store and mark old store deleted...

However, since we do not replace the new store in region-cache, for region with its leader from region-cache on this tikv, the status will never change and keeps unavailable https://github.com/tikv/client-go/blob/01758810e8419b784c0b652ad32ef03664df50bd/internal/locate/region_request.go#L804-L811

When accessing new regions that were not previously cached, the new store point is used and the leader may became available

We do have a related issue https://github.com/tikv/client-go/issues/1401 , and a related fix https://github.com/tikv/client-go/pull/1402, However, it only stop the health check for the old store object, which still not replace the store-pointer in region-cache.

Here is my question, why we do not reuse the old store object directly instead of create a new one?

Workaround: restart the TiDB instance

AndreMouche avatar Dec 15 '25 20:12 AndreMouche

Here is my question, why we do not reuse the old store object directly instead of create a new one? https://github.com/tikv/client-go/blob/01758810e8419b784c0b652ad32ef03664df50bd/internal/locate/store_cache.go#L494-L521

AndreMouche avatar Dec 15 '25 20:12 AndreMouche