dcache icon indicating copy to clipboard operation
dcache copied to clipboard

routing tables inconsistent

Open calestyo opened this issue 4 years ago • 17 comments

Hey.

I've reported on this several times before at least in RT issue 9641.

What is a major problem (and persists up to including 6.0.9) is, that quite often after restarting either the whole dCache or parts of it, the routing tables of the core domains get in an inconsistent state.

When this happened, then e.g. the PoolManager thinks the pools would be there and fine, however they are not... transfers fail with crude messages and there doesn't seem to be any sort of fallback to the core domains where the routes are correct.... the whole ZK thing seems to be pretty fragile while providing no real benefits (i.e. despite running three instances and three core domains, we're actually less stable).

The following is what I just got after restarting all our dCache cluster:

# diff <( dcache_admin "\s System@core0 route" | sort | sed "s/l-.*//" ) <( dcache_admin "\s System@core1 route" | sort | sed "s/l-.*//"  )
6c6
< DOMAIN                         *@core1               
---
> DOMAIN                         *@core0                   c-core0-AAWfXEK-Vfg-AAWfXELBQBA@core1              
120d119
< DOMAIN                         *@pool_lcg-lrz-dc15_7 
319d317
< DOMAIN                         *@pool_lcg-lrz-dc64_6 
355,356c353,354
< QUEUE                        acm@*                                                   acm@core0              
< QUEUE                        acm@*                                                     *@core1              
---
> QUEUE                        acm@*                                                   acm@core1              
> QUEUE                        acm@*                                                     *@core0              
457d454
< QUEUE             lcg-lrz-dc15_7@*                                                     *@pool_lcg-lrz-dc15_7
656d652
< QUEUE             lcg-lrz-dc64_6@*                                                     *@pool_lcg-lrz-dc64_6
694c690
< QUEUE                       topo@*                                                     *@core1              
---
> QUEUE                       topo@*                                                     *@core0              
696c692
< QUEUE                       topo@*                                                  topo@core0              
---
> QUEUE                       topo@*                                                  topo@core1              

And:

# diff <( dcache_admin "\s System@core0 route" | sort | sed "s/l-.*//" ) <( dcache_admin "\s System@core2 route" | sort | sed "s/l-.*//"  )
6,7c6,7
< DOMAIN                         *@core1               
< DOMAIN                         *@core2               
---
> DOMAIN                         *@core0                   c-core0-AAWfXELYUlA-AAWfXELZmnA@core2              
> DOMAIN                         *@core1                   c-core1-AAWfXELXhzA-AAWfXELZtcg@core2              
47d46
< DOMAIN                         *@pool_lcg-lrz-dc08_4 
95d93
< DOMAIN                         *@pool_lcg-lrz-dc13_2 
110d107
< DOMAIN                         *@pool_lcg-lrz-dc14_7 
116d112
< DOMAIN                         *@pool_lcg-lrz-dc15_3 
144d139
< DOMAIN                         *@pool_lcg-lrz-dc18_1 
192d186
< DOMAIN                         *@pool_lcg-lrz-dc22_9 
246d239
< DOMAIN                         *@pool_lcg-lrz-dc28_3 
355c348,349
< QUEUE                        acm@*                                                   acm@core0              
---
> QUEUE                        acm@*                                                   acm@core2              
> QUEUE                        acm@*                                                     *@core0              
357d350
< QUEUE                        acm@*                                                     *@core2              
384d376
< QUEUE             lcg-lrz-dc08_4@*                                                     *@pool_lcg-lrz-dc08_4
432d423
< QUEUE             lcg-lrz-dc13_2@*                                                     *@pool_lcg-lrz-dc13_2
447d437
< QUEUE             lcg-lrz-dc14_7@*                                                     *@pool_lcg-lrz-dc14_7
453d442
< QUEUE             lcg-lrz-dc15_3@*                                                     *@pool_lcg-lrz-dc15_3
481d469
< QUEUE             lcg-lrz-dc18_1@*                                                     *@pool_lcg-lrz-dc18_1
529d516
< QUEUE             lcg-lrz-dc22_9@*                                                     *@pool_lcg-lrz-dc22_9
583d569
< QUEUE             lcg-lrz-dc28_3@*                                                     *@pool_lcg-lrz-dc28_3
693a680
> QUEUE                       topo@*                                                     *@core0              
695,696c682
< QUEUE                       topo@*                                                     *@core2              
< QUEUE                       topo@*                                                  topo@core0              
---
> QUEUE                       topo@*                                                  topo@core2              

dCache self seems to never automatically recover from such a state, one really has to restart the affected domains... which is hard to understand... the domains should just retry to register themselves... and dcache should perhaps notice as well, that the routing tables differ.

2019 I wrote a Icinga check that tests this automatically, but the problem existed already longer. git://git.srv.scientia.net/calestyo/lmu/tools/dcache/check_compare_core_domains_routing_tables

There is most likely a similar problem in the topo map, i.e. the output of \s topo@core0 gettopomap which can differ between different core domains. However, given that output format is ... uhm... strange and in no way sorted, I wasn't able to write a check for this (without considerable effort).

Cheers, Chris.

calestyo avatar Feb 25 '20 02:02 calestyo

Anything new on this... ? This makes dCache pretty fragile, cause everytime domains start up, some may not be registered correctly

Still happens with 7.2.10. ... and actually seems to have gotten worse:

CRITICAL: Routing tables of core domains differ.
core0 ↔ core1:
	DOMAIN  *@srm_lcg-lrz-srm          
	QUEUE   srm_lcg-lrz-srm@*          *@srm_lcg-lrz-srm
	QUEUE   SrmManager@*               *@srm_lcg-lrz-srm
	TOPIC   LoginBrokerRequestTopic@*  *@srm_lcg-lrz-srm
	TOPIC   WatchedServiceTopic@*      *@srm_lcg-lrz-srm
	TOPIC   LoginBrokerTopic@*         *@srm_lcg-lrz-srm
	TOPIC   PoolMonitorTopic@*         *@srm_lcg-lrz-srm

between our 3 core domains (core0, core1, core2)... even after all pools have been restarted often enough until each of them correctly showed up in the routing tables, the above are still left.

Why do they show up in core0 and core2,. but not core1?

calestyo avatar Feb 22 '22 19:02 calestyo

That srm door was then also basically broken,... it gave endless of these:

Feb 22 20:43:16 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:16+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:31 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:31+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:31 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:31+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:37 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:37+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:43:46 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:43:46+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:01 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:01+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:07 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:07+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:15 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:15+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.
Feb 22 20:44:15 lcg-lrz-srm dcache@srm_lcg-lrz-srm[28819]: 2022-02-22 20:44:15+01:00 (srm_lcg-lrz-srm) [] Internal server error: Request to [>gPlazma@local] timed out.

and didn't recover that broken routing table entry by itself.

After restarting the srm door.. the routing tables were back to (what I'd consider) ok.

Before today’s upgrade (from 6.0.18 -> 7.2.10), this only ever happened with pools so far.. never anything else... since at least 7.2.10... it also happens to other components.

:-(

calestyo avatar Feb 22 '22 19:02 calestyo

I restarte ZK (one by one, so that at least 2 were always running),.. but still one pool fell out:

Feb 23 01:59:15 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 01:59:15+01:00 (c-core1-AAXYoMmquGg-AAXYoMmtb7g) [core1,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed
Feb 23 02:00:04 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:04+01:00 (System) [] Invalid config event received: {server.2=lcg-lrz-dcache1.grid.lrz.de:2182:2183:participant, server.1=lcg-lrz-dcache0.grid.lrz.de:2182:2183:participant, server.3=lcg-lrz-dcache2.grid.lrz.de:2182:2183:participant, version=0}
Feb 23 02:00:06 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:06+01:00 (c-core2-AAXYoMmpGlg-AAXYoMms7tA) [core2,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed
Feb 23 02:00:17 lcg-lrz-dc27 dcache@pool_lcg-lrz-dc27_1[32611]: 2022-02-23 02:00:17+01:00 (c-core2-AAXYpPxn8eA-AAXYpPxoAYE) [core2,7.2.10,CORE] Error while reading from tunnel: java.net.SocketException: Socket closed

and also leading to inconsistencies in the routing table:

core0 ↔ core1:
DOMAIN  *@pool_lcg-lrz-dc27_1  
QUEUE   lcg-lrz-dc27_1@*       *@pool_lcg-lrz-dc27_1

core0 ↔ core2:
DOMAIN  *@pool_lcg-lrz-dc27_1  
QUEUE   lcg-lrz-dc27_1@*       *@pool_lcg-lrz-dc27_1

btw: This is the check I'm using to compare the routing tables: https://github.com/dCache/monitoring/tree/master/monitoring-plugins/check_compare_core_domains_routing_tables

calestyo avatar Feb 23 '22 01:02 calestyo

Hi Chris.

Which ZooKeeper version are you running?

Thanks, Lea

lemora avatar Feb 23 '22 09:02 lemora

Could you verify that each of your ZK servers has the same information about the core nodes?

To do this, you can use a command like:

bin/zkCli.sh -server NODE:2181 ls /dcache/lm/cores-uri

I'm assuming you have your ZK nodes configured to use the standard port (2181). Also NODE should be replaced with the hostname of a ZK node.

The output from this command should list the code domains:

[root@prometheus zookeeper]# bin/zkCli.sh -server localhost:2181 ls /dcache/lm/cores-uri
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[dCacheDomain]
[root@prometheus zookeeper]# 

Here, there is just one core domain dCacheDomain

paulmillar avatar Feb 23 '22 09:02 paulmillar

@lemora

Which ZooKeeper version are you running?

It's 3.5.8-3, i.e. a Debianification of Tigran’s RPM.

@paulmillar

bin/zkCli.sh -server NODE:2181 ls /dcache/lm/cores-uri

Gives (when querying each of the ZKs directly) on all:

2022-02-23 13:50:23,560 - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1394] - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x30004fb1d67009c, negotiated timeout = 30000

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[core0, core1, core2]

(the first INFO log line differs, but that shouldn't matter) Right after all in dCache was stopped, and just the core nodes (i.e. no pools, no doors, no srm) was running.

Then I started all the rest of dCache, which resulting my test with the following differences in routing:

CRITICAL: Routing tables of core domains differ.
core0 ↔ core1:
	DOMAIN	*@pool_lcg-lrz-dc24_6	
	DOMAIN	*@pool_lcg-lrz-dc36_5	
	QUEUE 	lcg-lrz-dc24_6@*     	*@pool_lcg-lrz-dc24_6
	QUEUE 	lcg-lrz-dc36_5@*     	*@pool_lcg-lrz-dc36_5

core0 ↔ core2:
	DOMAIN	*@pool_lcg-lrz-dc24_6 	
	DOMAIN	*@pool_lcg-lrz-dc52_10	
	DOMAIN	*@pool_lcg-lrz-dc53_3 	
	QUEUE 	lcg-lrz-dc52_10@*     	*@pool_lcg-lrz-dc52_10
	QUEUE 	lcg-lrz-dc24_6@*      	*@pool_lcg-lrz-dc24_6
	QUEUE 	lcg-lrz-dc53_3@*      	*@pool_lcg-lrz-dc53_3

which then, at leas per experience, never resolve automatically, but require restarts of these pools.

However, even with the issue present, the above zkCli command still shows all three core nodes.

calestyo avatar Feb 23 '22 12:02 calestyo

You should find a .zookeeper file on the SRM node. The file should have a name like /var/log/dcache/srm_lcg-lrz-srm.zookeeper.

Could you somehow attach an extract from that file?

The extract should end at the time you noticed the problem above (core1 missing the SRM domain). The extract should start when that domain started most recently, just before you noticed the problem? (i.e., from the domain start until you noticed the problem).

paulmillar avatar Feb 23 '22 18:02 paulmillar

If you mean the one from https://github.com/dCache/dcache/issues/5326#issuecomment-1048149959 ... than that's unfortunately already gone, and I had cleared the files after completely restarting everything.

It contains now just:

022-02-23 13:47:06,266 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 13:47:06,269 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null

calestyo avatar Feb 23 '22 18:02 calestyo

That's output from a zookeeper log, from one of the domains that was affected with this earlier today:

# cat pool_lcg-lrz-dc24_6.log
2022-02-23 13:47:13,090 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 13:47:13,175 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-23 15:03:20,393 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-23 15:03:21,501 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-23 15:03:21,502 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-23 15:03:21,534 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-23 15:03:24,016 [CURATOR: STARTED] connection state now CONNECTED
2022-02-23 15:03:24,016 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null

(including already the restart of the domain)

calestyo avatar Feb 23 '22 20:02 calestyo

Thanks for sharing. It doesn't look like there was any problems, at least at the curator level.

paulmillar avatar Feb 24 '22 11:02 paulmillar

Is there anything else I can do for debugging?

I mean it's rather annoying to reproduce the problem (since that means breaking transfers) - but it's also pretty easy (simply restarting a bunch of pools at the same time... and surely some will break) ... so I should be able to do it if necessary.

calestyo avatar Feb 24 '22 14:02 calestyo

btw: Is it forseen in the code that... whatever the domains do when they initially register... they retry this when it fails? Or is there something like a test, whether it succeeded with all core domains?

calestyo avatar Feb 24 '22 14:02 calestyo

Thanks for the info so far.

I have another question.

All domains should establish a TCP connection to all core domains. This uses port 11111 by default.

If a domain is missing from one of the core domain's routing table, is there also a corresponding absence of this messaging TCP connection from the missing domain to that core domain?

Cheers, Paul.

paulmillar avatar Feb 24 '22 15:02 paulmillar

I've repeatedly restarted some currently unused pools until the issue was triggered (needed only two restarts ^^):

# check_compare_core_domains_routing_tables core0 core1 core2
CRITICAL: Routing tables of core domains differ.
core0 ↔ core2:
	DOMAIN	*@pool_lcg-lrz-dc51_5	
	QUEUE 	lcg-lrz-dc51_5@*     	*@pool_lcg-lrz-dc51_5

In general:

grep dcache-cells-plain /etc/services 
dcache-cells-plain	11111/tcp		# dCache Cells Messaging
dcache-cells-plain	11111/udp		# dCache Cells Messaging

And on that node in specific:

tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:41548	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	336185/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:51670	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	336185/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33050	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	336185/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42100	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678165/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52224	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678165/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33600	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678165/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42116	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678166/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52232	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678166/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33620	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678166/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42072	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678167/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52194	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678167/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33572	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678167/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42064	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678168/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52188	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678168/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33566	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678168/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42088	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678169/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52210	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678169/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33588	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678169/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42086	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678170/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52206	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678170/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33584	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678170/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42076	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678171/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52200	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678171/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42118	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678172/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52236	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678172/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33608	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678172/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42112	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678173/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52234	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678173/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33610	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678173/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42124	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678174/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52248	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678174/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33626	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678174/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:42096	lcg-lrz-dcache0.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678175/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:52218	lcg-lrz-dcache1.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678175/java
tcp6	0	0	lcg-lrz-dc51.grid.lrz.de:33596	lcg-lrz-dcache2.grid.lrz.de:dcache-cells-plain	ESTABLISHED	678175/java
# dcache status | grep 678171
pool_lcg-lrz-dc51_5  running (for 11 minutes) 678171 dcache /var/log/dcache/pool_lcg-lrz-dc51_5.log  

One sees that PID 678171 has only connections to core0 and core1, which matches what's shown by my check (which always diffs the first specified core domain, with each others - not the 2nd itself with the 3rd or so).

# cat  /var/log/dcache/pool_lcg-lrz-dc51_5.log
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: An illegal reflective access operation has occurred
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 23 13:47:09 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: WARNING: All illegal access operations will be denied in a future release
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 23 13:47:19 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-23 13:47:19+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled
Feb 24 17:08:23 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[336180]: 2022-02-24 17:08:23+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Shutdown
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: An illegal reflective access operation has occurred
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 24 17:09:05 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: WARNING: All illegal access operations will be denied in a future release
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 24 17:09:15 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:15+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled
Feb 24 17:09:39 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[672262]: 2022-02-24 17:09:39+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Shutdown
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: An illegal reflective access operation has occurred
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Illegal reflective access by org.nustaq.serialization.FSTClazzInfo (file:/usr/share/dcache/classes/fst-2.56.jar) to field java.lang.String.value
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Please consider reporting this to the maintainers of org.nustaq.serialization.FSTClazzInfo
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Feb 24 17:09:51 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: WARNING: All illegal access operations will be denied in a future release
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/srv/dcache/pools/5/data;meta=/srv/dcache/pools/5/meta]]].
Feb 24 17:10:01 lcg-lrz-dc51 dcache@pool_lcg-lrz-dc51_5[678171]: 2022-02-24 17:10:01+01:00 (lcg-lrz-dc51_5) [] Pool mode changed to enabled

there you see my last two restarts... and that the pool thinks it would have been correctly enabled (in PoolManager).

# cat  /var/log/dcache/zookeeper/pool_lcg-lrz-dc51_5.log
2022-02-24 17:08:23,951 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-24 17:08:26,020 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-24 17:08:26,020 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-24 17:08:26,021 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-24 17:08:26,021 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-24 17:08:26,640 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-24 17:09:04,565 [CURATOR: STARTED] connection state now CONNECTED
2022-02-24 17:09:04,566 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-24 17:09:39,747 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/inotify, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core0, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core1, rc=0, children=null
2022-02-24 17:09:41,811 [CURATOR: STARTED] event: type=REMOVE_WATCHES, name=null, path=/dcache/lm/cores-uri/core2, rc=0, children=null
2022-02-24 17:09:42,121 [CURATOR: STOPPED] event: type=CLOSING, name=null, path=null, rc=0, children=null
2022-02-24 17:09:51,071 [CURATOR: STARTED] event: type=WATCHED, name=null, path=null, rc=3, children=null
2022-02-24 17:09:51,089 [CURATOR: STARTED] connection state now CONNECTED

Cheers, Chris.

calestyo avatar Feb 24 '22 16:02 calestyo

OK, good.

So, the most likely explanation is that the problematic domain (the pool domain, in the latest example) believes there are only two domains. It simply doesn't connect to the third domain (core2).

We just need to find out why.

paulmillar avatar Feb 25 '22 09:02 paulmillar

Anything further I can do here?

btw: It's really great that we can look into this issue... but for that there is a good workaround... much more pressing would be the problem reported in the other ticket as well as on [email protected], that space reservation reporting in the info service is broken and thus SRR data... we already got ticketed because of that.

calestyo avatar Feb 25 '22 15:02 calestyo

Since our upgrade from 8.2.11 to 9.1.2 (plus also Debian upgrade from 11 to 12 and PostrgreSQL 13->15) this whole issue seems to have gotten considerably worse:

Previously, only pool nodes were ever affected.. and when I restarted the whole dCache cluster... perhaps ~4 of the pools were affected.

That doubled, + core domains and doors are also affected and need to be restarted to get their routing fixed.

calestyo avatar Aug 03 '23 11:08 calestyo