dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Migration jobs stuck and Pool appears unresponsive

Open cfgamboa opened this issue 1 year ago • 1 comments
trafficstars

Hello,

dCache release 9.2.17

A pool dc267_10 appears to be stuck, currently there are 400 p2p transfers to the pool. The max number allowed is increased to 100. The transfers are from migration jobs.

For example this migration job for the file 0000500691EC18CD4BF787E0AA7022D2E96B, size 8196943164 appears to be have an active transfer

[dccore03] (dcdoor31_1@dcdoor31oneDomain) admin > migration info 179

Command : migration move -storage=MCTAPE:MC -permanent -concurrency=120 -eager -replicas=1 -target=pgroup -- MCTAPE-write State : RUNNING Queued : 0 Attempts : 292992 Targets : dc242_10,dc240_10,dc263_10,dc269_10,dc246_10,dc267_10,dc248_10,dc265_10,dc261_10,dc254_10,dc252_10,dc237_10,dc239_10,dc256_10,dc266_10,dc241_10,dc245_10,dc268_10,dc262_10,dc249_10,dc264_10,dc260_10,dc259_10,dc236_10,dc238_10,dc270_10,dc255_10 Completed : 128736 files; 303479319986394 bytes; 99% Total : 303521277057964 bytes Concurrency: 150 Running tasks: [439494] 0000500691EC18CD4BF787E0AA7022D2E96B: TASK.Copying -> [dc267_10@local] [439632] 00002C75947271B947F4802282CE5286B119: TASK.Copying -> [dc267_10@local] [439879] 0000A1D47825261B485F8857749DC1863FD4: TASK.Copying -> [dc267_10@local] [439990] 00004C03422A41694C60852B4FA131CEFE4B: TASK.Copying -> [dc267_10@local] [440138] 0000467AD762815E428981B424846A7A6B35: TASK.Copying -> [dc267_10@local]

But there is not activity

[root@dc267 data]# ls -l 0000500691EC18CD4BF787E0AA7022D2E96B -rw-r--r-- 1 root root 1939860332 May 21 23:54 0000500691EC18CD4BF787E0AA7022D2E96B

Also the destination pool appears to be stuck

image

Load on the pool server is not high, however commands like sweeper purge, does not seem to take effect.

https://dcache.sdcc.bnl.gov/usatlas/pools/list/PoolManager//dmz-pools/spaces

There is not special information from System@dc267_10-Domain

The pool in debug mode show only shows this type of entries

22 May 2024 14:48:18 (dc267_10) [] Reloaded EACL namespace (signing_policy) from /etc/grid-security/certificates/fd3270d3.signing_policy.
22 May 2024 14:48:18 (dc267_10) [] Reloaded EACL namespace (signing_policy) from /etc/grid-security/certificates/fdf90b95.signing_policy.
22 May 2024 14:48:18 (dc267_10) [] Reloaded EACL namespace (signing_policy) from /etc/grid-security/certificates/f30dd6ad.signing_policy.
22 May 2024 14:48:18 (dc267_10) [] Reloaded EACL namespace (signing_policy) from /etc/grid-security/certificates/9629661e.signing_policy.
22 May 2024 14:49:00 (dc267_10) [] Reloaded CRL from file:/etc/grid-security/certificates/3fb4d8a6.r0.
22 May 2024 14:49:18 (dc267_10) [] Sweeper tries to reclaim 9223372036854775807 bytes.

I have created a dump file dc267_10-Domain_15756098301082517203.jfr which could be sent as needed.

Please advise,

All the best, Carlos

cfgamboa avatar May 22 '24 19:05 cfgamboa