etcd
etcd copied to clipboard
When I killed my java etcd program, I found that etcd would stop working and report the following error. What is the reason for this?
What happened?
{"level":"warn","ts":"2022-08-05T14:20:13.984+0800","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-05T14:20:10.778+0800","time spent":"3.20534961s","remote":"10.3.71.106:49706","response type":"/v3lockpb.Lock/Lock","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}}
{"level":"fatal","ts":"2022-08-05T14:31:47.331+0800","caller":"backend/batch_tx.go:152","msg":"failed to find a bucket","bucket-name":"key","stack":"go.etcd.io/etcd/server/v3/mvcc/backend.(*batchTx).unsafePut\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/batch_tx.go:155\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTx).UnsafeSeqPut\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/batch_tx.go:146\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).UnsafeSeqPut\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/batch_tx.go:368\ngo.etcd.io/etcd/server/v3/mvcc.(*storeTxnWrite).delete\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/kvstore_txn.go:279\ngo.etcd.io/etcd/server/v3/mvcc.(*storeTxnWrite).deleteRange\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/kvstore_txn.go:257\ngo.etcd.io/etcd/server/v3/mvcc.(*storeTxnWrite).DeleteRange\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/kvstore_txn.go:102\ngo.etcd.io/etcd/server/v3/mvcc.(*metricsTxnWrite).DeleteRange\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/metrics_txn.go:46\ngo.etcd.io/etcd/server/v3/lease.(*lessor).Revoke\n\t/go/src...........
What did you expect to happen?
etcd cannot stop working when the jetcd client disconnects unexpectedly.
How can we reproduce it (as minimally and precisely as possible)?
Connect etcd through java etcd client api, then store key-value and read key-value in etcd, and manually kill the java etcd client during the running process. The java client I use is jetcd.
Anything else we need to know?
No response
Etcd version (please run commands below)
v3.5.4
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
No response
Hi @luomengY,
I would suggest you provide additional details in order to facilitate the investigation of this issue. In particular, anybody who would like to look into this problem, would probably need to try and replicate it first.
Would you be able to share some more information about the program you run on the client side please ? A minimalistic java program which replicates the issue would be ideal, so that people can understand clearly what you're doing. Are you running this client program on the same machine as the etcd server ?
Also, it would be beneficial to provide some information about your etcd server setup, as requested in the template.
Finally, I feel that some insight about how/when exactly you kill the client process would help.
Thanks.
Please help me take a look at this problem.
I will try to replicate this issue on my side, @luomengY
Just to be sure I understand what's happening in your scenario, can you confirm that you:
- start an etcd server on a single node (with which parameters ?)
- run the junit test cases above, which will execute lockTest1toMaster; this will create a lock in etcd with no expiration and then sleep
- while lockTest1toMaster is sleeping, kill the junit test
If I am not mistaken, you're explaining that at this moment, your etcd server raises a fatal error (failed to find a bucket","bucket-name":"key"), and is unable to serve any more request. Is that correct ?
I did another experiment. When my java client runs on a server outside the etcd cluster, this problem does not occur. When the java client and etcd are on the same server, is there any special attention?
And I found that when my java client program is on the same server as etcd, every time I stop and restart the java client, it will report an error: "range failed to find revision pair".
I have not been able to replicate this issue on my side.
Here are the steps I followed:
- start an etcd server on a linux box with one node, using release 3.5
- start lockTest1toMaster: the logs show that this one has the lock
- start lockTest2toStandby: the logs show that this one is waiting for the lock
- start lockTest3toStandby: the logs show that this one is waiting for the lock
Then I interrupt lockTest1toMaster, and the logs show that lockTest2toStandby gets the lock. Then I interrupt lockTest2toStandby, and the logs show that lockTest3toStandby gets the lock.
I don't get the same errors that you had, and my etcd server is still functional (able to put and get keys) after these actions.
Also I don't see why the server would behave in a different way, whether the client runs on the same box or not,
Is your client java program and etcd on the same server? This error occurs when etcd and the client program are running on the same server.
Yes, this was all on the same server.
ok, I have found the reason, due to the inconsistency of the metadata of my etcd after starting and stopping.
Do you mind explaining a bit more your findings @luomengY please, so that people facing the same error get some clues about it ?
Is there anything to investigate in regards to the metadata inconsistency you found ?
The problem I encountered is because our platform has a master-slave switchover, and there is a master-slave synchronization data directory. We also put the etcd data in this directory, which is equivalent to manually synchronizing the data, resulting in once the master-slave switchover , which may cause changes to the primary and secondary metadata.
Thanks for the update @luomengY
I can indeed see how this issue is related to your particular setup, and not etcd itself. I would suggest we close this issue then, as no further action is required.
ok
---Original--- From: "Jeremy @.> Date: Sat, Aug 13, 2022 21:17 PM To: @.>; Cc: @.@.>; Subject: Re: [etcd-io/etcd] When I killed my java etcd program, I found thatetcd would stop working and report the following error. What is the reason forthis? (Issue #14314)
Thanks for the update @luomengY
I can indeed see how this issue is related to your particular setup, and not etcd itself. I would suggest we close this issue then, as no further action is required.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Close this issue per the discussion
Is your client java program and etcd on the same server? This error occurs when etcd and the client program are running on the same server.
---Original--- From: "Jeremy @.> Date: Tue, Aug 9, 2022 15:34 PM To: @.>; Cc: @.@.>; Subject: Re: [etcd-io/etcd] When I killed my java etcd program, I found thatetcd would stop working and report the following error. What is the reason forthis? (Issue #14314)
I have not been able to replicate this issue on my side.
Here are the steps I followed:
start an etcd server on a linux box with one node, using release 3.5
start lockTest1toMaster: the logs show that this one has the lock
start lockTest2toStandby: the logs show that this one is waiting for the lock
start lockTest3toStandby: the logs show that this one is waiting for the lock
Then I interrupt lockTest1toMaster, and the logs show that lockTest2toStandby gets the lock. Then I interrupt lockTest2toStandby, and the logs show that lockTest3toStandby gets the lock.
I don't get the same errors that you had, and my etcd server is still functional (able to put and get keys) after these actions.
Also I don't see why the server would behave in a different way, whether the client runs on the same box or not,
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>