client-rust “Connection refused ” caused by stopping tikv process when the tikv is marked as offline status

“Connection refused ” caused by stopping tikv process when the tikv is marked as offline status

Open BryceCao opened this issue 6 months ago • 6 comments

we mark a tikv as offline status by sending a pd api for delete the store, its a normal opreation as shrinking tikv nodes. Then, we kill the tikv process to simulate hardware damage, in a client which is start before shrinking operation, a scan api fail with print "gRPC api error: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} }". But ,in a new start client, a same scan api can return correct result. Other way, when we start the tikv process again, the old client can return the correct result too.

Jun 06 '25 07:06 BryceCao

What's the version of TiKV & client-rust ?

Which scan you are using, Txn or Raw ? Please show the codes about how client-rust is used.

Jun 06 '25 08:06 pingyu

What's the version of TiKV & client-rust ?

Which scan you are using, Txn or Raw ? Please show the codes about how client-rust is used.

v8.0.1 tikv and the latest rust client code, we use scan by Raw. we use c++ client bridge rust client, like:

client_tikv = new tikv_client::RawKVClient(pd_vect); auto kv_pairs = client_tikv->scan(start_marker, end_marker, max_to_get + 1, kTimeoutMs);

Jun 06 '25 09:06 BryceCao

I found that directly killing a TiKV process can also trigger this phenomenon. I suspect it is related to the leaders on the killed TiKV node — the client may have incorrectly accessed the killed TiKV instance.

Jun 06 '25 09:06 BryceCao

I found that directly killing a TiKV process can also trigger this phenomenon. I suspect it is related to the leaders on the killed TiKV node — the client may have incorrectly accessed the killed TiKV instance.

Some start_marker scan can return normally. I suspect that this bug is triggered only when the Region being accessed has its leader located on the TiKV instance that was killed.

Jun 06 '25 09:06 BryceCao

It seems that scan_inner is not handling the gRPC error properly. Similar to #419.

Jun 06 '25 09:06 pingyu

It seems that scan_inner is not handling the gRPC error properly. Similar to #419.

but the scan_inner function will definitely call the single_shard_handler function, and the single_shard_handler function contains a check for is_grpc_error.

Jun 09 '25 08:06 BryceCao

@pingyu, i have submitted a PR(https://github.com/tikv/client-rust/pull/495), please review it. By my test,it can resolve this issue.

Jul 03 '25 06:07 BryceCao

client-rust client-rust copied to clipboard

“Connection refused ” caused by stopping tikv process when the tikv is marked as offline status

client-rust
client-rust copied to clipboard