dcache
dcache copied to clipboard
Unable to reconnect to ZooKeeper service, session expired
Dear all,
Recently, we noticed some pools failed to reconnect to ZooKeeper Service complaining the session expired. It needs to restart the pool to fix it. The error log is listed below.
03 May 2024 10:11:46 (System) [] Session 0x10041c40b758d24 for server dczoo01.usatlas.bnl.gov/10.42.34.241:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect to ZooKeeper service, session 0x10041c40b758d24 has expired
at org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1432)
at org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1283)
03 May 2024 10:11:46 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
We are using dCache 9.2.17. The issue also happened in the old version 9.2.6, btw. It's not clear the root cause. I didn't noticed the network problem between the pools and the Zookeeper servers.
Regards, Qiulan
Hi Qiulan.
Which ZooKeeper version are you using? You have a cluster of 3 standalone ones, right?
The line Invalid config event received indicates that you ZooKeeper configuration is wrong, please check it.
Since 6.2, dCache requires the associated ZooKeeper servers to be of version 3.5 or above, but with ZooKeeper 3.5.0 the configuration format has changed. Previously, the zoo.conf files could contain a separate line stating the client port, clientPort=2181, which now needs to be at the end of the line: server.1=os-zk-node01:2888:3888;2181. As far as I know, newer versions were backward compatible and this was just a warning, but maybe this has changed.
As to the SessionExpiredException: Unable to reconnect to ZooKeeper service: is your zookeeper cluster stable, is there anything logged on the zk side?
Lea
Hello Lea,
Thank you for your reply.
We are using ZooKeeper 3.6.3. The logs on ZK side shows the session expired too.
2024-05-03 10:11:46,254 [myid:0] - INFO [QuorumPeer[myid=0](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):ZooKeeperServer@1059] - Invalid session 0x10041c40b758d24 for client /10.42.38.120:43812, probably expired
Thanks, Qiulan
Yes we have a cluster of three nodes and it is stable
@QiulanHuang Does all dcache components run 9.2.17?
Hello @kofemann We are running in mixed mode now. Most of the nodes are running 9.2.17 and around 10 new added pool nodes are running 9.2.20.
Thanks, Qiulan
Hello
The goal is to simulate a host network disturbance and the pool's state.
dc234 is used to simulate this, there is a test pool enabled.
dcache pool ls
POOL DOMAIN META SIZE FREE PATH
dc234_1 dc234oneDomain db 117714G 134108G /data0/dcache_pool_1/pool
Pool restarted
22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
22 Aug 2024 10:30:11 (dc234_1) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data0/dcache_pool_1/pool/data;meta=/data0/dcache_pool_1/pool/meta]]].
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to enabled
State of the network bond0 active peers
[root@dc234 ~]# ifstat
#kernel
Interface RX Pkts/Rate TX Pkts/Rate RX Data/Rate TX Data/Rate
RX Errs/Drop TX Errs/Drop RX Over/Rate TX Coll/Rate
lo 546421K 0 546421K 0 1980M 0 1980M 0
0 0 0 0 0 0 0 0
Internal25G00 2405M 0 4114M 0 3974M 0 882022K 0
0 3319K 0 0 0 0 0 0
Internal25G01 3832M 0 2440M 0 3848M 0 3711M 0
0 2472K 0 0 0 0 0 0
bond0 1018M 0 2847M 0 2631M 0 3394M 0
0 10735K 0 0 0 0 0 0
simulating partial network degration by stopping a peer from bond
[root@dc234 ~]# ip link set dev Internal25G00 down
[root@dc234 ~]# tail -100f /var/log/messages|grep bond
Aug 22 10:35:34 dc234 kernel: bond0: link status definitely down for interface Internal25G00, disabling it
[root@dc234 ~]# ifstat
#kernel
Interface RX Pkts/Rate TX Pkts/Rate RX Data/Rate TX Data/Rate
RX Errs/Drop TX Errs/Drop RX Over/Rate TX Coll/Rate
lo 60 0 60 0 5128 0 5128 0
0 0 0 0 0 0 0 0
Internal25G01 3563 0 4925 0 278760 0 3034K 0
0 0 0 0 0 0 0 0
bond0 7136 0 5629 0 534351 0 3209K 0
0 0 0 0 0 0 0 0
checking state of pool
[root@dc234 ~]# tail -f /var/log/dcache/dc234oneDomain.log
22 Aug 2024 10:30:07 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
22 Aug 2024 10:30:11 (dc234_1) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data0/dcache_pool_1/pool/data;meta=/data0/dcache_pool_1/pool/meta]]].
22 Aug 2024 10:30:11 (dc234_1) [] Pool mode changed to enabled
22 Aug 2024 10:35:58 (System) [] Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
22 Aug 2024 10:35:58 (System) [] Session 0x10041c40b759013 for server dczoo02.usatlas.bnl.gov/10.42.34.242:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)
22 Aug 2024 10:35:58 (System) [] Invalid config event received: {server.2=dczoo03.usatlas.bnl.gov:2888:3888:participant, server.1=dczoo02.usatlas.bnl.gov:2888:3888:participant, server.0=dczoo01.usatlas.bnl.gov:2888:3888:participant, version=0}
Relevant error message is
22 Aug 2024 10:35:58 (System) [] Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
22 Aug 2024 10:35:58 (System) [] Session 0x10041c40b759013 for server dczoo02.usatlas.bnl.gov/10.42.34.242:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26680ms for session id 0x10041c40b759013
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)
2.1. Pool appears operational without restarting it
[dccore01] (dc234_1@dc234oneDomain) admin > pp get file 0000384776BCD2D349CB8CCF1ADF11952C87 dc262_6
Transfer Initiated
[dccore01] (dc234_1@dc234oneDomain) admin > rep ls
0000384776BCD2D349CB8CCF1ADF11952C87 <C-------X--L(0)[0]> 4967844 si={cgamboa:USERS}
re-enable the interface to the bond
[root@dc234 ~]# ip link set dev Internal25G00 up
[root@dc234 ~]# ifstat
#kernel
Interface RX Pkts/Rate TX Pkts/Rate RX Data/Rate TX Data/Rate
RX Errs/Drop TX Errs/Drop RX Over/Rate TX Coll/Rate
lo 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Internal25G00 2405M 0 4114M 0 3974M 0 882199K 0
0 3319K 0 0 0 0 0 0
Internal25G01 8005 0 4302 0 5612K 0 3538K 0
0 0 0 0 0 0 0 0
bond0 2405M 0 18446744073529M 0 18446744073394M 0 885738K 0
0 3319K 0 0 0 0 0 0
simulating the other interface failure
22 Aug 2024 10:43:19 (System) [] Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
22 Aug 2024 10:43:19 (System) [] Session 0x10041c40b759013 for server dczoo03.usatlas.bnl.gov/10.42.34.243:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)
The pool remains accessible
[dccore01] (dc234_1@dc234oneDomain) admin > rep ls
0000384776BCD2D349CB8CCF1ADF11952C87 <C-------X--L(0)[0]> 4967844 si={c[gamboa](https://git.racf.bnl.gov/gitea/gamboa):USERS}
gamboa
gamboa commented [4 weeks ago](https://git.racf.bnl.gov/gitea/SDCC/dcache/issues/2#issuecomment-1639)
Poster
Collaborator
Automatic restart of pool due to the zookeeper event
22 Aug 2024 10:43:19 (System) [] Session 0x10041c40b759013 for server dczoo03.usatlas.bnl.gov/10.42.34.243:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 26681ms for session id 0x10041c40b759013
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1251)
So for network instabilities the systems appears to be handling accordingly the failover Will it be simple for dcache to start cataloging this type of errors?
@cfgamboa thanks for the detailed update
Will it be simple for dcache to start cataloging this type of errors?
I don't understand the question. What is the expected behavior?
For my simulation the expected behavior is that pool remains operational without having to be restarted when network degradation occurs. The error observed on the pool might be able to be classified? This could help when other similar errors occur like the one reported by initially in the ticket.