etcd icon indicating copy to clipboard operation
etcd copied to clipboard

When I killed my java etcd program, I found that etcd would stop working and report the following error. What is the reason for this?

Open luomengY opened this issue 3 years ago • 10 comments

What happened?

{"level":"warn","ts":"2022-08-05T14:20:13.984+0800","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-08-05T14:20:10.778+0800","time spent":"3.20534961s","remote":"10.3.71.106:49706","response type":"/v3lockpb.Lock/Lock","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}}

{"level":"fatal","ts":"2022-08-05T14:31:47.331+0800","caller":"backend/batch_tx.go:152","msg":"failed to find a bucket","bucket-name":"key","stack":"go.etcd.io/etcd/server/v3/mvcc/backend.(*batchTx).unsafePut\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/batch_tx.go:155\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTx).UnsafeSeqPut\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/batch_tx.go:146\ngo.etcd.io/etcd/server/v3/mvcc/backend.(*batchTxBuffered).UnsafeSeqPut\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/backend/batch_tx.go:368\ngo.etcd.io/etcd/server/v3/mvcc.(*storeTxnWrite).delete\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/kvstore_txn.go:279\ngo.etcd.io/etcd/server/v3/mvcc.(*storeTxnWrite).deleteRange\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/kvstore_txn.go:257\ngo.etcd.io/etcd/server/v3/mvcc.(*storeTxnWrite).DeleteRange\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/kvstore_txn.go:102\ngo.etcd.io/etcd/server/v3/mvcc.(*metricsTxnWrite).DeleteRange\n\t/go/src/go.etcd.io/etcd/release/etcd/server/mvcc/metrics_txn.go:46\ngo.etcd.io/etcd/server/v3/lease.(*lessor).Revoke\n\t/go/src...........

What did you expect to happen?

etcd cannot stop working when the jetcd client disconnects unexpectedly.

How can we reproduce it (as minimally and precisely as possible)?

Connect etcd through java etcd client api, then store key-value and read key-value in etcd, and manually kill the java etcd client during the running process. The java client I use is jetcd.

Anything else we need to know?

No response

Etcd version (please run commands below)

v3.5.4

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

luomengY avatar Aug 05 '22 07:08 luomengY

Hi @luomengY,

I would suggest you provide additional details in order to facilitate the investigation of this issue. In particular, anybody who would like to look into this problem, would probably need to try and replicate it first.

Would you be able to share some more information about the program you run on the client side please ? A minimalistic java program which replicates the issue would be ideal, so that people can understand clearly what you're doing. Are you running this client program on the same machine as the etcd server ?

Also, it would be beneficial to provide some information about your etcd server setup, as requested in the template.

Finally, I feel that some insight about how/when exactly you kill the client process would help.

Thanks.

jbml avatar Aug 05 '22 12:08 jbml

Please help me take a look at this problem.

luomengY avatar Aug 07 '22 15:08 luomengY

I will try to replicate this issue on my side, @luomengY

Just to be sure I understand what's happening in your scenario, can you confirm that you:

  1. start an etcd server on a single node (with which parameters ?)
  2. run the junit test cases above, which will execute lockTest1toMaster; this will create a lock in etcd with no expiration and then sleep
  3. while lockTest1toMaster is sleeping, kill the junit test

If I am not mistaken, you're explaining that at this moment, your etcd server raises a fatal error (failed to find a bucket","bucket-name":"key"), and is unable to serve any more request. Is that correct ?

jbml avatar Aug 08 '22 02:08 jbml

I did another experiment. When my java client runs on a server outside the etcd cluster, this problem does not occur. When the java client and etcd are on the same server, is there any special attention?

luomengY avatar Aug 08 '22 06:08 luomengY

And I found that when my java client program is on the same server as etcd, every time I stop and restart the java client, it will report an error: "range failed to find revision pair".

luomengY avatar Aug 09 '22 01:08 luomengY

I have not been able to replicate this issue on my side.

Here are the steps I followed:

  • start an etcd server on a linux box with one node, using release 3.5
  • start lockTest1toMaster: the logs show that this one has the lock
  • start lockTest2toStandby: the logs show that this one is waiting for the lock
  • start lockTest3toStandby: the logs show that this one is waiting for the lock

Then I interrupt lockTest1toMaster, and the logs show that lockTest2toStandby gets the lock. Then I interrupt lockTest2toStandby, and the logs show that lockTest3toStandby gets the lock.

I don't get the same errors that you had, and my etcd server is still functional (able to put and get keys) after these actions.

Also I don't see why the server would behave in a different way, whether the client runs on the same box or not,

jbml avatar Aug 09 '22 07:08 jbml

Is your client java program and etcd on the same server? This error occurs when etcd and the client program are running on the same server.

luomengY avatar Aug 09 '22 14:08 luomengY

Yes, this was all on the same server.

jbml avatar Aug 10 '22 02:08 jbml

ok, I have found the reason, due to the inconsistency of the metadata of my etcd after starting and stopping.

luomengY avatar Aug 10 '22 03:08 luomengY

Do you mind explaining a bit more your findings @luomengY please, so that people facing the same error get some clues about it ?

Is there anything to investigate in regards to the metadata inconsistency you found ?

jbml avatar Aug 10 '22 06:08 jbml

The problem I encountered is because our platform has a master-slave switchover, and there is a master-slave synchronization data directory. We also put the etcd data in this directory, which is equivalent to manually synchronizing the data, resulting in once the master-slave switchover , which may cause changes to the primary and secondary metadata.

luomengY avatar Aug 13 '22 11:08 luomengY

Thanks for the update @luomengY

I can indeed see how this issue is related to your particular setup, and not etcd itself. I would suggest we close this issue then, as no further action is required.

jbml avatar Aug 13 '22 13:08 jbml

ok

---Original--- From: "Jeremy @.> Date: Sat, Aug 13, 2022 21:17 PM To: @.>; Cc: @.@.>; Subject: Re: [etcd-io/etcd] When I killed my java etcd program, I found thatetcd would stop working and report the following error. What is the reason forthis? (Issue #14314)

Thanks for the update @luomengY

I can indeed see how this issue is related to your particular setup, and not etcd itself. I would suggest we close this issue then, as no further action is required.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

luomengY avatar Aug 14 '22 02:08 luomengY

Close this issue per the discussion

ahrtr avatar Aug 22 '22 01:08 ahrtr

Is your client java program and etcd on the same server? This error occurs when etcd and the client program are running on the same server.

---Original--- From: "Jeremy @.> Date: Tue, Aug 9, 2022 15:34 PM To: @.>; Cc: @.@.>; Subject: Re: [etcd-io/etcd] When I killed my java etcd program, I found thatetcd would stop working and report the following error. What is the reason forthis? (Issue #14314)

I have not been able to replicate this issue on my side.

Here are the steps I followed:

start an etcd server on a linux box with one node, using release 3.5

start lockTest1toMaster: the logs show that this one has the lock

start lockTest2toStandby: the logs show that this one is waiting for the lock

start lockTest3toStandby: the logs show that this one is waiting for the lock

Then I interrupt lockTest1toMaster, and the logs show that lockTest2toStandby gets the lock. Then I interrupt lockTest2toStandby, and the logs show that lockTest3toStandby gets the lock.

I don't get the same errors that you had, and my etcd server is still functional (able to put and get keys) after these actions.

Also I don't see why the server would behave in a different way, whether the client runs on the same box or not,

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

luomengY avatar Oct 11 '22 07:10 luomengY