dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Unable to reconnect to ZooKeeper service, session expired

Open QiulanHuang opened this issue 1 year ago • 5 comments
trafficstars

Dear all,

Recently, we noticed some pools failed to reconnect to ZooKeeper Service complaining the session expired. It needs to restart the pool to fix it. The error log is listed below.

03 May 2024 10:11:46 (System) [] Session 0x10041c40b758d24 for server dczoo01.usatlas.bnl.gov/10.42.34.241:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect to ZooKeeper service, session 0x10041c40b758d24 has expired
        at org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1432)
        at org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1283)
03 May 2024 10:11:46 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}

We are using dCache 9.2.17. The issue also happened in the old version 9.2.6, btw. It's not clear the root cause. I didn't noticed the network problem between the pools and the Zookeeper servers.

Regards, Qiulan

QiulanHuang avatar May 03 '24 15:05 QiulanHuang

Hi Qiulan.

Which ZooKeeper version are you using? You have a cluster of 3 standalone ones, right? The line Invalid config event received indicates that you ZooKeeper configuration is wrong, please check it. Since 6.2, dCache requires the associated ZooKeeper servers to be of version 3.5 or above, but with ZooKeeper 3.5.0 the configuration format has changed. Previously, the zoo.conf files could contain a separate line stating the client port, clientPort=2181, which now needs to be at the end of the line: server.1=os-zk-node01:2888:3888;2181. As far as I know, newer versions were backward compatible and this was just a warning, but maybe this has changed.

As to the SessionExpiredException: Unable to reconnect to ZooKeeper service: is your zookeeper cluster stable, is there anything logged on the zk side?

Lea

lemora avatar May 07 '24 13:05 lemora

Hello Lea,

Thank you for your reply.

We are using ZooKeeper 3.6.3. The logs on ZK side shows the session expired too.

2024-05-03 10:11:46,254 [myid:0] - INFO  [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):ZooKeeperServer@1059] - Invalid session 0x10041c40b758d24 for client /10.42.38.120:43812, probably expired

Thanks, Qiulan

QiulanHuang avatar May 07 '24 14:05 QiulanHuang

Yes we have a cluster of three nodes and it is stable

cfgamboa avatar May 08 '24 13:05 cfgamboa

@QiulanHuang Does all dcache components run 9.2.17?

kofemann avatar Jun 25 '24 14:06 kofemann

Hello @kofemann We are running in mixed mode now. Most of the nodes are running 9.2.17 and around 10 new added pool nodes are running 9.2.20.

Thanks, Qiulan

QiulanHuang avatar Jul 08 '24 20:07 QiulanHuang

Hello

The goal is to simulate a host network disturbance and the pool's state.

dc234 is used to simulate this, there is a test pool enabled.

dcache pool ls
POOL    DOMAIN         META SIZE    FREE    PATH                      
dc234_1 dc234oneDomain db   117714G 134108G /data0/dcache_pool_1/pool 

Pool restarted

22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
22 Aug 2024 10:30:11 (dc234_1) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data0/dcache_pool_1/pool/data;meta=/data0/dcache_pool_1/pool/meta]]].
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to enabled

State of the network bond0 active peers

[root@dc234 ~]# ifstat 
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate  
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate  
lo               546421K 0       546421K 0         1980M 0         1980M 0      
                       0 0             0 0             0 0             0 0      
Internal25G00      2405M 0         4114M 0         3974M 0       882022K 0      
                       0 3319K         0 0             0 0             0 0      
Internal25G01      3832M 0         2440M 0         3848M 0         3711M 0      
                       0 2472K         0 0             0 0             0 0      
bond0              1018M 0         2847M 0         2631M 0         3394M 0      
                       0 10735K        0 0             0 0             0 0    

simulating partial network degration by stopping a peer from bond

[root@dc234 ~]# ip link set dev Internal25G00 down
[root@dc234 ~]# tail -100f /var/log/messages|grep bond
Aug 22 10:35:34 dc234 kernel: bond0: link status definitely down for interface Internal25G00, disabling it


[root@dc234 ~]# ifstat
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate  
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate  
lo                    60 0            60 0          5128 0          5128 0      
                       0 0             0 0             0 0             0 0      
Internal25G01       3563 0          4925 0        278760 0         3034K 0      
                       0 0             0 0             0 0             0 0      
bond0               7136 0          5629 0        534351 0         3209K 0      
                       0 0             0 0             0 0             0 0 

checking state of pool

[root@dc234 ~]# tail -f /var/log/dcache/dc234oneDomain.log
22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
22 Aug 2024 10:30:11 (dc234_1) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data0/dcache_pool_1/pool/data;meta=/data0/dcache_pool_1/pool/meta]]].
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to enabled
22 Aug 2024 10:35:58 (System) [] Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
22 Aug 2024 10:35:58 (System) [] Session 0x10041c40b759013 for server dczoo02.usatlas.bnl.gov/10.42.34.242:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)
22 Aug 2024 10:35:58 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}

Relevant error message is

22 Aug 2024 10:35:58 (System) [] Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
22 Aug 2024 10:35:58 (System) [] Session 0x10041c40b759013 for server dczoo02.usatlas.bnl.gov/10.42.34.242:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)

2.1. Pool appears operational without restarting it

[dccore01] (dc234_1@dc234oneDomain) admin > pp get file 0000384776BCD2D349CB8CCF1ADF11952C87 dc262_6
Transfer Initiated
[dccore01] (dc234_1@dc234oneDomain) admin > rep ls
0000384776BCD2D349CB8CCF1ADF11952C87 <C-------X--L(0)[0]> 4967844 si={cgamboa:USERS}
re-enable the interface to the bond
[root@dc234 ~]# ip link set dev Internal25G00 up
[root@dc234 ~]# ifstat
#kernel
Interface        RX Pkts/Rate    TX Pkts/Rate    RX Data/Rate    TX Data/Rate  
                 RX Errs/Drop    TX Errs/Drop    RX Over/Rate    TX Coll/Rate  
lo                     0 0             0 0             0 0             0 0      
                       0 0             0 0             0 0             0 0      
Internal25G00      2405M 0         4114M 0         3974M 0       882199K 0      
                       0 3319K         0 0             0 0             0 0      
Internal25G01       8005 0          4302 0         5612K 0         3538K 0      
                       0 0             0 0             0 0             0 0      
bond0              2405M 0      18446744073529M 0      18446744073394M 0       885738K 0      
                       0 3319K         0 0             0 0             0 0      
simulating the other interface failure
22 Aug 2024 10:43:19 (System) [] Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
22 Aug 2024 10:43:19 (System) [] Session 0x10041c40b759013 for server dczoo03.usatlas.bnl.gov/10.42.34.243:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)

The pool remains accessible

[dccore01] (dc234_1@dc234oneDomain) admin > rep ls
0000384776BCD2D349CB8CCF1ADF11952C87 <C-------X--L(0)[0]> 4967844 si={c[gamboa](https://git.racf.bnl.gov/gitea/gamboa):USERS}
gamboa
gamboa commented [4 weeks ago](https://git.racf.bnl.gov/gitea/SDCC/dcache/issues/2#issuecomment-1639)
Poster
Collaborator
Automatic restart of pool due to the zookeeper event

22 Aug 2024 10:43:19 (System) [] Session 0x10041c40b759013 for server dczoo03.usatlas.bnl.gov/10.42.34.243:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)

So for network instabilities the systems appears to be handling accordingly the failover Will it be simple for dcache to start cataloging this type of errors?

cfgamboa avatar Sep 18 '24 13:09 cfgamboa

@cfgamboa thanks for the detailed update

Will it be simple for dcache to start cataloging this type of errors?

I don't understand the question. What is the expected behavior?

kofemann avatar Sep 18 '24 17:09 kofemann

For my simulation the expected behavior is that pool remains operational without having to be restarted when network degradation occurs. The error observed on the pool might be able to be classified? This could help when other similar errors occur like the one reported by initially in the ticket.

cfgamboa avatar Sep 18 '24 17:09 cfgamboa