dcache
dcache copied to clipboard
info service stuck in 9.2
Hi,
We freshly upgraded dCache to 9.2.6, everything is ok except the info service which seems stuck during startup. The service is visible from the admindoor :
c-dCacheDomain-AAYMT5uTOvg@infoDomain
c-dCacheDomain-AAYMT5uTOvg-AAYMT5uTeXg@infoDomain
info@infoDomain
lm@infoDomain
RoutingMgr@infoDomain
System@infoDomain
But I can't connect to it
[ccdcamcli06] (System@infoDomain) ageorget > \c info
(1) Cell does not exist.
Dec 12 14:15:01 ccdcamcli06 systemd[1]: Started dCache infoDomain domain.
Dec 12 14:15:27 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:15:27 (info) [lZU:LM admin] Discarding PingMessage because its age of 14391 ms exceeds its time to live of 900 ms.
Dec 12 14:16:50 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:16:50 (info) [httpd InfoGetSerialisedData] Discarding InfoGetSerialisedDataMessage because its age of 45884 ms exceeds its time to live of 3
Dec 12 14:17:01 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:17:01 (info) [lZU:LM admin] Discarding PingMessage because its age of 50253 ms exceeds its time to live of 900 ms.
Dec 12 14:17:38 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:17:38 (info) [frontend] Discarding 'xgetcellinfo' because its age of 55972 ms exceeds its time to live of 54000 ms.
Dec 12 14:19:14 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:19:14 (info) [frontend] Discarding 'xgetcellinfo' because its age of 92430 ms exceeds its time to live of 54000 ms.
Dec 12 14:20:26 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:20:26 (info) [lZU:LM admin] Discarding PingMessage because its age of 112596 ms exceeds its time to live of 900 ms.
Dec 12 14:20:34 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:20:34 (info) [frontend] Discarding 'xgetcellinfo' because its age of 111797 ms exceeds its time to live of 54000 ms.
Dec 12 14:22:41 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:22:41 (info) [frontend] Discarding 'xgetcellinfo' because its age of 178852 ms exceeds its time to live of 54000 ms.
I can connect to System@infoDomain service but I can't find any issue. Looks like the service is stuck somewhere.
[ccdcamcli06] (System@infoDomain) ageorget > ps -f
Name State Queue Q-time/ms Threads Class Additional info
c-dCacheDomain-AAYMT9az8vg-AAYMT9a0b_g A 0 0 2 LocationMgrTunnel Connected to dCacheDomain
lm A 0 0 0 LocationManager
RoutingMgr A 0 0 1 CoreRoutingManager
System A 0 0 3 SystemCell infoDomain:IOrec=18;IOexc=0;MEM=177010328
info A 605 104139 4 UniversalSpringCell
c-dCacheDomain-AAYMT9az8vg A 0 0 1 LocationManagerConnector Connected
[ccdcamcli06] (System@infoDomain) ageorget > getcellinfos
c-dCacheDomain-AAYMT9az8vg-AAYMT9a0b_gA 0 2 LocationMgrTunnel Connected to dCacheDomain
lm A 0 0 LocationManager lm
RoutingMgr A 0 1 CoreRoutingManager RoutingMgr
System A 0 3 SystemCell infoDomain:IOrec=8;IOexc=0;MEM=250689088
info A 4074 UniversalSpringCell info
I checked that all services are Up and running (doors, pools, core services), nothing looks unresponsive except info service. The service is configured in a dedicated domain and all other domains (Poolmanager, topo, httpd, frontend, billing, gplazma) on the server are ok (and zookeeper)
[infoDomain]
[infoDomain/info]
info.cell.name=info
Anything I can do to debug this and unstuck it?
Cheers, Adrien
Hello Adrien.
Does \c info@infoDomain work?
Is there anything of note in the domain's log file?
Hi Lea,
No I can't connect to it even if info service is listed in the admindoor :
[ccdcamcli06] (local) ageorget > \c info@infoDomain
(1) Cell does not exist.
I can connect to his System domain (and others like lm, RoutingMgr, etc)
[ccdcamcli06] (System@infoDomain) ageorget > \c System@infoDomain
[ccdcamcli06] (System@infoDomain) ageorget > info
CellDomainName = infoDomain
Zone = (none)
Message payload serializer = standard
I/O rcv=62;asw=61;frw=0;rpy=61;exc=0
Memory : tot=1048576000;free=937927680;used=110648320
Cells (Threads)
c-dCacheDomain-AAYMT9az8vg-AAYMT9a0b_g(c-dCacheDomain-AAYMT9az8vg-AAYMT9a0b_g-0Tunnel,)
lm()
RoutingMgr(RoutingMgr-1)
System(System-0-SendThread(ccdcamcli07:2181)System-0-EventThread,System-12,)
info(info-0XmlConduit conduit,DGA-Scheduler,pool-6-thread-1,)
c-dCacheDomain-AAYMT9az8vg(TunnelConnector-dCacheDomain)
The log file only contains this kind of message :
Dec 12 14:54:05 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:54:05 (info) [frontend] Discarding 'xgetcellinfo' because its age of 1042341 ms exceeds its time to live of 54000 ms.
Dec 12 14:56:12 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:56:12 (info) [frontend] Discarding 'xgetcellinfo' because its age of 1109417 ms exceeds its time to live of 54000 ms.
Dec 12 14:58:13 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 14:58:13 (info) [frontend] Discarding 'xgetcellinfo' because its age of 1170747 ms exceeds its time to live of 54000 ms.
Dec 12 15:00:18 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 15:00:18 (info) [frontend] Discarding 'xgetcellinfo' because its age of 1235222 ms exceeds its time to live of 54000 ms.
Dec 12 15:00:43 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 15:00:43 (info) [httpd InfoGetSerialisedData] Discarding InfoGetSerialisedDataMessage because its age of 1236848 ms exceeds its time to live of 3600 ms.
Dec 12 15:01:32 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 15:01:32 (info) [frontend] Discarding 'xgetcellinfo' because its age of 1249628 ms exceeds its time to live of 54000 ms.
Dec 12 15:02:02 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 15:02:02 (info) [httpd InfoGetSerialisedData] Discarding InfoGetSerialisedDataMessage because its age of 1258101 ms exceeds its time to live of 3600 ms.
Dec 12 15:02:51 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 15:02:51 (info) [frontend] Discarding 'xgetcellinfo' because its age of 1268955 ms exceeds its time to live of 54000 ms.
Dec 12 15:03:07 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 15:03:07 (info) [httpd InfoGetSerialisedData] Discarding InfoGetSerialisedDataMessage because its age of 1278576 ms exceeds its time to live of 3600 ms.
Dec 12 15:04:57 ccdcamcli06 dcache@infoDomain[109136]: 12 Dec 2023 15:04:57 (info) [frontend] Discarding 'xgetcellinfo' because its age of 1334417 ms exceeds its time to live of 54000 ms.
Well I managed to run info service on a pool node, with a similar configuration (CentOS7, openjdk version "11.0.21").
I tried on the 3 different headnodes and it fails. On 2 different pool nodes it works...
[ccdcamcli06] (info@infoDomain2) ageorget > info
--- info (support for CLI) ---
Overview of the info cell:
1 conduit (1 enabled)
16 data-gathering activities.
4 available serialisers.
6 state watchers.
0 pending updates to state.
Did this work before the upgrade? Did you change your layout at all? If you ensure that there is only one core domain in te system, does the problem still show up?
It was working well before the upgrade. I did not change the layout conf, the only modification made was for cleaner-disk/cleaner-hsm naming on another headnode. Only one broker core is still configured on one of the headnodes
[dCacheDomain]
dcache.broker.scheme=core
[dCacheDomain/poolmanager]
...
[infoDomain]
[infoDomain/info]
I tried on the 3 different headnodes and it fails. On 2 different pool nodes it works...
So a headnode in your case is not necessarily a core domain, is that right? And it fails on every head node, but not on other machines? How is a core domain different from other ones in your system?
We have 3 headnodes running dCache core services. The first one (with dcache.broker.scheme=core) with gPlazma, PoolManager, billing, topo, info, httpd. The 2 others are running HA critical services like gPlazma, pnfsManager, srmmanager, Poolmanager, Spacemanager, ...
As info service failed to start on the first one, I tried to set up it on the 2 others but it also fails. So I tried on 2 different pool servers (el7, java11 / el8, java17) and it works on both...
Could it be a conflict with older dCache library or something like that?
Can it be a conflict with older dCache library or something like that?
Is your dCache system not at version 9.2.6 everywhere?
This needs more investigating. For some reason the message queue to/from info is apparently overloaded for some reason, which is also why you cannot connect and have such a large backlog. What I don't yet understand is why this works on some nodes and not on others, if that is indeed reliably the case.
Could you please provide a java flight recording for when it fails? If you manage to connect to the info cell before it apparently freezes, you can start it via the admin interface.
Version 9.2.6 is installed on all servers. I already try to enable debug log from the admin but it's impossible to connect to the cell, even in the first seconds. Is there a way to add some some Java debug params in the conf to get more information from the cell startup?
You can set the log level to debug by adding the follwing to your layout file as part of the infoDomain domain configuration:
dcache.log.level.file=debug
You can start a flight recording via the System cell in the admin interface which you seem to be able to access:
jfr enable, then later, after collecting for a while: jfr disable <outFileName>.
Be aware that both log and jfr files will likely grow very quickly.
Here is the debug log and flight record from System@infoDomain cell :
Thank you. Unfortunately I was not yet able to find the cause of the problem. Could you please collect a similar kind of log from the info domain but set dcache.log.level.file=trace for more details? Let it run for ~1 minute and verify that you see the info service being stuck, so we capture relevant data.
Hi Lea,
Here the same log files with trace level enabled infoDomain_12376021140723890342.jfr.gz infoDomain.log.gz