accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Examine multi-thread correctness in ZooPropstore.get()

Open EdColeman opened this issue 3 years ago • 8 comments

From PR Comment ZooPropStore.get() method - this can be reviewed as a follow-on to the single node prop store changes.

@keith-turner said:

Tracing through the code, I think the following sequence of events could happen. Does this seem correct? Want to make sure I am following the code correctly.

Thread 1 calls checkZkConnection() and does not block because the conneciton is currently ok Thread 2 sees connection loss event in watcher and clears the ready and queues up an event to clear the cache Thread 3 clears the cache (running in an executor on behalf of event queued by thread 2) Thread 1 executes cache.get()... because the cache was cleared, a loader is executed in a thread pool Thread 4 executes the code to load data from ZK... the connection is still lost, so what happens here? This thread is executing on behalf of the cache.get() initiated by thread 1. I am not sure exactly what happens in step 5 above, does the load from ZK retrty when it sees a connection lost exception? If it does then the checkZkConnection() is redundant. If it does not then the checkZkConnection() does not always prevent problems in the case of connection loss, and maybe it should retry inorder to eliminate the race condition.

EdColeman avatar May 10 '22 21:05 EdColeman

I think the test most relevant to this issue is ZooBasedConfigIT.expireTest. It may be possible to derive a new test based on this test that shuts down ZooKeeper during the load/reload operation.

dlmarion avatar Jun 23 '22 16:06 dlmarion

I wrote a test that closes the connection on the ZooKeeper server-side using the client sessionId. If what I did is correct, then I don't believe that the ZooPropStore is getting connection events. I'm going to push a branch and draft PR for this shortly.

dlmarion avatar Jul 01 '22 15:07 dlmarion

If what I did is correct,

It wasn't. I was closing a different client ZK session. I believe that I have a test now that implements the order of operations from the original description. I had to move one line of code in ZooPropStore.get so that the watcher is registered before the call to checkZKConnection. Here is the test output:

2022-07-01T18:38:54,281 [impl.PropStoreMetrics] DEBUG: Creating PropStore metrics
2022-07-01T18:38:54,317 [metrics.MetricsUtil] INFO : Metric producer PropStoreMetrics initialize
2022-07-01T18:38:54,319 [impl.ZooPropStore] DEBUG: Have a ZooKeeper connection and found instance node: 2a0af30d-5918-425d-b94e-78b66a466ccd
2022-07-01T18:38:54,326 [impl.TestZooPropStore] INFO : super.checkZkConnection() called, calling runCloseConnectionTask
2022-07-01T18:38:54,329 [server.ZooKeeperServer] INFO : Closing session 0x10001954cc20002
2022-07-01T18:38:54,433 [impl.PropStoreWatcher] DEBUG: ZooKeeper disconnected event received
2022-07-01T18:38:54,435 [impl.ZooPropStore] TRACE: connectionEvent, clearing cache
2022-07-01T18:38:55,945 [server.ZooKeeperServer] INFO : Invalid session 0x10001954cc20002 for client /127.0.0.1:56280, probably expired
2022-07-01T18:38:55,946 [zookeeper.ZooSession] DEBUG: Session expired; WatchedEvent state:Expired type:None path:null
2022-07-01T18:38:55,947 [impl.PropStoreWatcher] INFO : ZooKeeper connection closed event received
2022-07-01T18:38:55,947 [impl.ZooPropStore] TRACE: connectionEvent, clearing cache
2022-07-01T18:38:56,327 [impl.TestZooPropStore] INFO : runCloseConnectionTask called, waiting for connection event to release latch
2022-07-01T18:38:56,327 [impl.TestZooPropStore] INFO : latch released
2022-07-01T18:38:56,327 [impl.PropCacheCaffeineImpl] TRACE: Called get() for SystemPropKey{InstanceId=2a0af30d-5918-425d-b94e-78b66a466ccd'}
2022-07-01T18:38:56,335 [impl.ZooPropLoader] TRACE: load called for SystemPropKey{InstanceId=2a0af30d-5918-425d-b94e-78b66a466ccd'}
2022-07-01T18:38:56,335 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to 127.0.0.1:39327
2022-07-01T18:38:56,335 [zookeeper.ZooSession] DEBUG: Connecting to 127.0.0.1:39327 with timeout 5000 with auth
2022-07-01T18:38:56,437 [server.ZooKeeperServer] INFO : got auth packet /127.0.0.1:56282
2022-07-01T18:38:56,437 [server.ZooKeeperServer] INFO : Session 0x10001954cc20003: auth success for scheme digest and address /127.0.0.1:56282
2022-07-01T18:38:56,441 [impl.PropStoreMetrics] TRACE: Load time: 106
2022-07-01T18:38:56,441 [impl.PropStoreMetrics] TRACE: Load count: 0 time:0.0

Note that the connection event is received twice, once for disconnect and once for close, thus the cache is cleared twice. The connection to ZooKeeper is re-established so that the ZooPropLoader can load the properties.

dlmarion avatar Jul 01 '22 18:07 dlmarion

I am not sure exactly what happens in step 5 above, does the load from ZK retrty when it sees a connection lost exception?

It appears that the ZooKeeper client will re-establish the connection

If it does then the checkZkConnection() is redundant.

That may be. Could easily modify this test to comment out that method call and see if the result is the same.

dlmarion avatar Jul 01 '22 18:07 dlmarion

The PropStoreWatcher will clear the cache when the connection is disconnected or closed (fires twice) and then the ZooPropLoader calls ZooReaderWriter.getData which is in a retry loop. The retry loop may return null after exhausting its retries, so I think the ReadyMonitor is an attempt to prevent that (although there is a race condition) by not calling ZooReaderWriter.getData until the PropStoreWatcher gets an event that says the connection is re-established.

dlmarion avatar Jul 05 '22 15:07 dlmarion

I am not sure that clearing twice is an issue. The state transitions of ZooKeeper seemed to indicate that you can get a disconnect and then reconnect without a close. Close is a terminal state.

As far a ready monitor itself - it is meant to block without hold locks - it functions as a barrier so that when known disconnected everyone is not banging on ZooKeeper with retries. There is an issue that code could get past the barrier, and then ZooKeeper connection is lost and I don't think that it can ever be prevented. Is should be a small window, but it will always be there. The issue would then be if it gets past the barrier and connection is lost is it handling it correctly.

If the code proceeds past the barrier and calls get() and the connection fails , the retry will eventually return null. On the null, then that should trigger an error / exception.

Currently I think that we do not handle connection loss as cleanly as we could - if a cluster tries to do a rolling ZK restart, there are a lot of tservers that drop offline. This does not fully fix that, but should get things closer and then the other problem areas can be addressed separately.

EdColeman avatar Jul 05 '22 18:07 EdColeman

There is an issue that code could get past the barrier, and then ZooKeeper connection is lost and I don't think that it can ever be prevented. Is should be a small window, but it will always be there. The issue would then be if it gets past the barrier and connection is lost is it handling it correctly. If the code proceeds past the barrier and calls get() and the connection fails , the retry will eventually return null. On the null, then that should trigger an error / exception.

If we used a different RetryFactory, one that used infiniteRetries, instead of the default one in ZooReader, then in this case it would not return null and trigger an exception.

dlmarion avatar Jul 05 '22 18:07 dlmarion

With infinite retries we'd need to handle a closed state and just quit. Not an issue, just something that needs to be addressed

EdColeman avatar Jul 05 '22 18:07 EdColeman