client-rust icon indicating copy to clipboard operation
client-rust copied to clipboard

“Connection refused ” caused by stopping tikv process when the tikv is marked as offline status

Open BryceCao opened this issue 6 months ago • 6 comments

we mark a tikv as offline status by sending a pd api for delete the store, its a normal opreation as shrinking tikv nodes. Then, we kill the tikv process to simulate hardware damage, in a client which is start before shrinking operation, a scan api fail with print "gRPC api error: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} }". But ,in a new start client, a same scan api can return correct result. Other way, when we start the tikv process again, the old client can return the correct result too.

BryceCao avatar Jun 06 '25 07:06 BryceCao

What's the version of TiKV & client-rust ?

Which scan you are using, Txn or Raw ? Please show the codes about how client-rust is used.

pingyu avatar Jun 06 '25 08:06 pingyu

What's the version of TiKV & client-rust ?

Which scan you are using, Txn or Raw ? Please show the codes about how client-rust is used.

v8.0.1 tikv and the latest rust client code, we use scan by Raw. we use c++ client bridge rust client, like:

client_tikv = new tikv_client::RawKVClient(pd_vect); auto kv_pairs = client_tikv->scan(start_marker, end_marker, max_to_get + 1, kTimeoutMs);

BryceCao avatar Jun 06 '25 09:06 BryceCao

I found that directly killing a TiKV process can also trigger this phenomenon. I suspect it is related to the leaders on the killed TiKV node — the client may have incorrectly accessed the killed TiKV instance.

BryceCao avatar Jun 06 '25 09:06 BryceCao

I found that directly killing a TiKV process can also trigger this phenomenon. I suspect it is related to the leaders on the killed TiKV node — the client may have incorrectly accessed the killed TiKV instance.

Some start_marker scan can return normally. I suspect that this bug is triggered only when the Region being accessed has its leader located on the TiKV instance that was killed.

BryceCao avatar Jun 06 '25 09:06 BryceCao

It seems that scan_inner is not handling the gRPC error properly. Similar to #419.

pingyu avatar Jun 06 '25 09:06 pingyu

It seems that scan_inner is not handling the gRPC error properly. Similar to #419.

but the scan_inner function will definitely call the single_shard_handler function, and the single_shard_handler function contains a check for is_grpc_error.

BryceCao avatar Jun 09 '25 08:06 BryceCao

@pingyu, i have submitted a PR(https://github.com/tikv/client-rust/pull/495), please review it. By my test,it can resolve this issue.

BryceCao avatar Jul 03 '25 06:07 BryceCao