dcache icon indicating copy to clipboard operation
dcache copied to clipboard

HSM flushing stuck

Open ageorget opened this issue 3 years ago • 35 comments

Hi,

Since we upgraded to dCache 6.2.35 (from 5.2), I sometimes observed tape flushing pool beeing stuck and stop flushing files on HSM without any reasons.

There are many files on the pool waiting to be flush, the next flush is scheduled but nothing happen :


--- flush (Controller for centralising flushing) ---
   Flush interval                : 60000 ms
   Maximum classes flushing      : 1000
   Minimum flush delay on error  : 60000 ms
   Next flush                    : Mon Jan 24 12:12:47 CET 2022

> flush ls
Class                 Active   Error  Last/min  Requests    Failed
datatape:atlas@osm         1       4      1046      1352         0

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   2/0
    from store 0/0
    delete     0/0

image

The flush ls command shows 1 active flush but there's no more active process alive, the Nearline storage manager also shows 2 active store requests. In the pool pinboard, the last message I see :
24 Jan 2022 11:09:47 [flush] [] Flushing SCI=mctape:atlas@osm;def=false;exp=0;pend=0;maxTotal=0;waiting=1 24 Jan 2022 11:09:47 [pool-10-thread-12245] [] remove entry 0000E34B76CD96F6442EAC7C5A6BD1135AA6: file deleted before being flushed 24 Jan 2022 11:09:47 [pool-32-thread-6219] [] Flush of 0000E34B76CD96F6442EAC7C5A6BD1135AA6 failed with: CacheException(rc=10001;msg=File not found in name space during pre-flush check.).

In the pool log file, I see multiple tries to kill 2 webdav TPC transfers (67281114 and 67281045). The flush stopped working at 10:48 :

24 Jan 2022 10:38:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:40:20 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli276@webdav-ccdcatli276Domain:AAXWULYCgzg RemoteTransferManager PoolAcceptFile 0000E4333B013CD84BEDA7069FA061F78558] Transfer failed: rejected GET: 500 Server Error
24 Jan 2022 10:40:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:40:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:41:00 (pool-atlas-tapebuffer-li308a) [] Flush of 0000E4333B013CD84BEDA7069FA061F78558 failed with: CacheException(rc=10001;msg=File not found in name space during pre-flush check.).
24 Jan 2022 10:42:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:42:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:44:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:44:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:46:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:46:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:48:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:48:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:50:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:50:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:51:56 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli276@webdav-ccdcatli276Domain:AAXWUOLi9jA RemoteTransferManager PoolAcceptFile 000071E57E4B25E845789551322EEBD3A455] Marking pool entry 000071E57E4B25E845789551322EEBD3A455 on pool-atlas-tapebuffer-li308a as BROKEN
24 Jan 2022 10:51:56 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli276@webdav-ccdcatli276Domain:AAXWUOLi9jA RemoteTransferManager PoolAcceptFile 000071E57E4B25E845789551322EEBD3A455] Transfer failed in post-processing: Checksum mismatch (expected=[1:acc4e187, 2:ea73c4856de5da08521c4c81217f4131], actual=[1:68e9b879, 2:ea73c4856de5da08521c4c81217f4131])
24 Jan 2022 10:52:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:52:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:54:39 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli223@webdav-ccdcatli223Domain:AAXWUPBbV1g RemoteTransferManager PoolAcceptFile 0000FD754E77757E4CAE817E3FB17FB22A92] Marking pool entry 0000FD754E77757E4CAE817E3FB17FB22A92 on pool-atlas-tapebuffer-li308a as BROKEN
24 Jan 2022 10:54:39 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli223@webdav-ccdcatli223Domain:AAXWUPBbV1g RemoteTransferManager PoolAcceptFile 0000FD754E77757E4CAE817E3FB17FB22A92] Transfer failed in post-processing: Checksum mismatch (expected=[1:31804d85, 2:a2338f4926e4234ff708bfccdfdb7012], actual=[1:abe6f82a, 2:a2338f4926e4234ff708bfccdfdb7012])
24 Jan 2022 10:54:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:54:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:56:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045
24 Jan 2022 10:56:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281114
24 Jan 2022 10:56:59 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWUPiWgPA RemoteTransferManager PoolAcceptFile 000092ECA93120E64AE18D66F30C27DD2036] Marking pool entry 000092ECA93120E64AE18D66F30C27DD2036 on pool-atlas-tapebuffer-li308a as BROKEN
24 Jan 2022 10:56:59 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWUPiWgPA RemoteTransferManager PoolAcceptFile 000092ECA93120E64AE18D66F30C27DD2036] Transfer failed in post-processing: Checksum mismatch (expected=[1:e15ab265, 2:0df6e5af8600b96da4e163fd322b32b4], actual=[1:96d43e51, 2:0df6e5af8600b96da4e163fd322b32b4])
24 Jan 2022 10:57:52 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWUPg92aA RemoteTransferManager PoolAcceptFile 0000F53D45E25C8E455EA42537E267662CEA] Marking pool entry 0000F53D45E25C8E455EA42537E267662CEA on pool-atlas-tapebuffer-li308a as BROKEN
24 Jan 2022 10:57:52 (pool-atlas-tapebuffer-li308a) [door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWUPg92aA RemoteTransferManager PoolAcceptFile 0000F53D45E25C8E455EA42537E267662CEA] Transfer failed in post-processing: Checksum mismatch (expected=[1:acc4e187, 2:ea73c4856de5da08521c4c81217f4131], actual=[1:68e9b879, 2:ea73c4856de5da08521c4c81217f4131])
24 Jan 2022 10:58:43 (pool-atlas-tapebuffer-li308a) [] Trying to kill <webdav> id=67281045

> mover ls
67281114 : CANCELED : 00001FA2D5EBE8A8458584D0CAF266E860BC IoMode=[READ, WRITE, CREATE] h={org.dcache.pool.movers.RemoteHttpsDataTransferProtocol@6a87cb64} bytes=1007318600 time/sec=7200 LM=31726 si={mctape:atlas}
67281045 : CANCELED : 00003CA3E034836040D9AB387F9FB6204997 IoMode=[READ, WRITE, CREATE] h={org.dcache.pool.movers.RemoteHttpsDataTransferProtocol@6b277f19} bytes=1109713400 time/sec=7200 LM=39032 si={mctape:atlas}

Can it be related to these stuck canceled TPC transfers?

The only solution to unstuck flush is to restart the pool. Best, Adrien

ageorget avatar Jan 24 '22 11:01 ageorget

The repeated attempts to kill those two transfers might be unrelated, but they are still a bug.

Just as extra information: once the JTM has decided to kill a mover, the mover goes into the CANCELED (sic) state. Any subsequent attempts by the JTM to kill that transfer has no effect, despite being logged.

So, the question is: why are these transfers not being cancelled correctly.

There's is (perhaps) an ancillary question: why should transfers in state CANCELED affect storing data onto tape -- assuming that is what is causing the store operations to fail.

Adrien, do you see additional log entries from subsequent scheduled flush/store operations?

paulmillar avatar Jan 25 '22 10:01 paulmillar

Hi Paul,

I couldn't find more additional log. What's strange is the Nearline storage manager which indicates 2 active store requests in the info pool whereas there's no more store requests in progress (and the hsm c:puts=1). Restarting the pool deleted this canceled transfers, removed these 2 files and flush was going on

24 Jan 2022 12:30:25 (pool-atlas-tapebuffer-li308a) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
24 Jan 2022 12:30:25 (pool-atlas-tapebuffer-li308a) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data/pool-atlas-tapebuffer-li308a/pool/data;meta=/data/pool-atlas-tapebuffer-li308a/pool/meta]]].
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering 00003CA3E034836040D9AB387F9FB6204997...
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering: Fetched storage info for 00003CA3E034836040D9AB387F9FB6204997 from name space.
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering: Removed 00003CA3E034836040D9AB387F9FB6204997 because name space entry was deleted.
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering 00001FA2D5EBE8A8458584D0CAF266E860BC...
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering: Fetched storage info for 00001FA2D5EBE8A8458584D0CAF266E860BC from name space.
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Recovering: Removed 00001FA2D5EBE8A8458584D0CAF266E860BC because name space entry was deleted.
24 Jan 2022 12:30:26 (pool-atlas-tapebuffer-li308a) [] Pool mode changed to enabled

I really don't know if there's a link between these canceled transfers and flush operations beeing stuck. Currently I see some CANCELED transfers on the pool but flush is running as normal...

ageorget avatar Jan 25 '22 11:01 ageorget

Hi @ageorget ,

if you check those CANCELED entries, are they all for removed files?

kofemann avatar Jan 26 '22 13:01 kofemann

Hi @kofemann,

If I check these 2 canceled entries, the request failed with

error 24 java.io.IOException: timed out while waiting for mover reply

reported by RemoteTransferManager@srmDomain.

Webdav door log :

January 24th 2022, 01:39:37.000 door:webdav-ccdcatli276@webdav-ccdcatli276Domain:AAXWSTQJ1tA RemoteTransferManager PoolAcceptFile 00001FA2D5EBE8A8458584D0CAF266E860BC,message=Creating new entry for 00001FA2D5EBE8A8458584D0CAF266E860BC

January 23rd 2022, 23:37:52.000 door:webdav-ccdcatli344@webdav-ccdcatli344Domain:AAXWR4CYFZA RemoteTransferManager PoolAcceptFile 00003CA3E034836040D9AB387F9FB6204997,message=Creating new entry for 00003CA3E034836040D9AB387F9FB6204997

doorinfo :

-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
client         | lcg-se1.sfu.computecanada.ca
mappedgid      | 124
mappeduid      | 3327
owner          | atlagrid
path           | /pnfs/in2p3.fr/data/atlas/atlasmctape/mc16_13TeV/AOD/e8338_e7400_s3126_r10201_r10210/mc16_13TeV.700409.ShEG_2211_Wenu_01NLO_unbiased_SMT.merge.AOD.e8338_e7400_s3126_r10201_r10210_tid27555308_00/AOD.27555308._000023.pool.root.1_1642984770
queuedtime     | 46
connectiontime | 7200046
action         | request
cellname       | RemoteTransferManager@srmDomain
datestamp      | 2022-01-24 01:39:37.584+01
errorcode      | 24
errormessage   | java.io.IOException: timed out while waiting for mover reply
pnfsid         | 00001FA2D5EBE8A8458584D0CAF266E860BC
transaction    | door:RemoteTransferManager@srmDomain:1642984777584-9856732
fqan           |

-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
client         | lcg-se1.sfu.computecanada.ca
mappedgid      | 124
mappeduid      | 3327
owner          | atlagrid
path           | /pnfs/in2p3.fr/data/atlas/atlasmctape/mc16_13TeV/AOD/e8338_e7400_s3126_r10201_r10210/mc16_13TeV.700409.ShEG_2211_Wenu_01NLO_unbiased_SMT.merge.AOD.e8338_e7400_s3126_r10201_r10210_tid27555308_00/AOD.27555308._000023.pool.root.1_1642977465
queuedtime     | 50
connectiontime | 7200051
action         | request
cellname       | RemoteTransferManager@srmDomain
datestamp      | 2022-01-23 23:37:52.047+01
errorcode      | 24
errormessage   | java.io.IOException: timed out while waiting for mover reply
pnfsid         | 00003CA3E034836040D9AB387F9FB6204997
transaction    | door:RemoteTransferManager@srmDomain:1642977472047-9841701

The webdav.mover.timeout is set to 7200000. Sometimes during a big tape import activity and mostly when transfers come from the US where traffic is often slow, the pools fail to respond to the door within the 7200sec. But it should not be a problem for dCache right?

ageorget avatar Jan 26 '22 14:01 ageorget

ah, ok. I was thinking that those are canceled flushes....

kofemann avatar Jan 26 '22 17:01 kofemann

Hi,

Ok, I think the CANCELED transfers are not related to this issue. I have the same problem this morning on another ATLAS Tape import pool.

The queue for flush still indicates 1 active request but there's no flush process in progress. No CANCELED transfers reported by the mover ls command this time.

flush ls
Class                 Active   Error  Last/min  Requests    Failed
datatape:atlas@osm         1       0      1664       763         0

mover ls
nothing

queue ls queue -l
               Class@Hsm : datatape:atlas@osm
 Expiration rest/defined : -107010 / 0   seconds
 Pending    rest/defined : 772 / 0
 Size       rest/defined : 6113343353791 / 0
 Active Store Procs.     : 1
  000009D0521F6FDE4495946B9DF8FF76F14B  osm     datatape:atlas      -                   
  0000A4B3EFD6229A498A8E17D022DBE975CC  osm     datatape:atlas      -                   
  000095FC79E6D3D44A34B63079CCCCB8A504  osm     datatape:atlas      -                   
  000081D950628F7947F59C7AA438DAFA8206  osm     datatape:atlas      -                   
...

The pool pinboard : 
28 Jan 2022 10:25:14 [pool-atlas-tapebuffer-li419a-1222] [6wc:J6E admin] Flushing SCI=datatape:atlas@osm;def=false;exp=0;pend=0;maxTotal=0;waiting=766

The manual flush is working :

> flush pnfsid 00006A099B706D8342DEB150BD2722639E8F
28 Jan 2022 10:29:23 [pool-32-thread-516] [6wc:J6E admin] Flushed 00006A099B706D8342DEB150BD2722639E8F to nearline storage: osm://osm/?store=cchpssdcache:/hpss5/dcache/atlas/datatape/2022/01/00006A099B706D8342DEB150BD2722639E8F&group=atlagrid&bfid=

Not the flush class :

flush class osm datatape:atlas
(3) java.lang.IllegalArgumentException: Is already active. from flush class

The pool log file with threads dump : https://mycore.core-cloud.net/index.php/s/w3gXOmlasrC8m7W

I also tried to increase c:puts from 1 to 5 to check if a flush request is blocking the pool flush but nothing.

Tell me if you want me to check something before I restart the pool. Adrien

ageorget avatar Jan 28 '22 09:01 ageorget

Hi, Same issue happened many time on the other dCache instance (6.2.38) during a large tape import activity by CTA. The symptoms are the same, the flush is stuck with some active store jobs and no way to find what and why :

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   2/0
    from store 0/0
    delete     0/0

pool log file : https://mycore.core-cloud.net/index.php/s/A69mFQUV1OwIUHN

ageorget avatar Feb 08 '22 09:02 ageorget

@ageorget do you see those flush processes on the pool?

kofemann avatar Feb 08 '22 10:02 kofemann

No, no more flush processes on the pool. I didn't make any changes in our hsm script. The only change was to upgrade from 5.2 to 6.2. When sometimes flush fail for many reasons (like Input/output error) this is correctly handle by the HSM script and dCache. And another point, CTA only use SRM/gridftp for their transfers so this is not related to a particular protocol.

ageorget avatar Feb 08 '22 12:02 ageorget

In the log I can see the following record

08 Feb 2022 02:06:37 (pool-cta-tape-li404a) [5de80647-3a9a-4f3e-ba8a-0b2a4d7cec81 000005515735DB2942C1BA5021371B6476C5] HSM script failed (script reported: 1: sh: line 0: kill: (28254) - No such process;)
08 Feb 2022 02:06:37 (pool-cta-tape-li404a) [] Flush of 000005515735DB2942C1BA5021371B6476C5 failed with: CacheException(rc=1;msg=HSM script failed (script reported: 1: sh: line 0: kill: (28254) - No such process;)).

Is it one of those files?

kofemann avatar Feb 08 '22 13:02 kofemann

I don't think so. I can't see any canceled or active mover/store requests. When I restarted the pool, no broken files or anything else appeared :

08 Feb 2022 10:48:48 (pool-cta-tape-li404a) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
08 Feb 2022 10:48:48 (pool-cta-tape-li404a) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
08 Feb 2022 10:48:48 (pool-cta-tape-li404a) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/data/pool-cta-tape-li404a/pool/data;meta=/data/pool-cta-tape-li404a/pool/meta]]].
08 Feb 2022 10:48:51 (pool-cta-tape-li404a) [] Pool mode changed to enabled

The pool files are ok, no file in error state. Since the restart, I already had some failed flush (as always) but nothing is stuck. Everything runs well :

08 Feb 2022 10:48:51 (pool-cta-tape-li404a) [] Pool mode changed to enabled
08 Feb 2022 11:06:17 (pool-cta-tape-li404a) [c3c3cb63-9888-40d7-b5d1-67bb5a01017f 0000C44963A6900C4CE28BF72997A08E2B19] HSM script failed (script reported: 1: sh: line 0: kill: (138790) - No such process;)
08 Feb 2022 11:06:17 (pool-cta-tape-li404a) [] Flush of 0000C44963A6900C4CE28BF72997A08E2B19 failed with: CacheException(rc=1;msg=HSM script failed (script reported: 1: sh: line 0: kill: (138790) - No such process;)).
08 Feb 2022 11:10:15 (pool-cta-tape-li404a) [5fb852a8-6c68-4248-b3e3-66b5622a2b0e 0000AA137E63B96D4ECA9715C006C4506C6D] HSM script failed (script reported: 1: sh: line 0: kill: (141912) - No such process;)
08 Feb 2022 11:10:15 (pool-cta-tape-li404a) [] Flush of 0000AA137E63B96D4ECA9715C006C4506C6D failed with: CacheException(rc=1;msg=HSM script failed (script reported: 1: sh: line 0: kill: (141912) - No such process;)).
08 Feb 2022 11:32:31 (pool-cta-tape-li404a) [23af9a43-9b5a-41c2-aa86-5b2a2b719d6c 000012BB532E18634477997E90E8780C6F52] HSM script failed (script reported: 1: sh: line 0: kill: (158822) - No such process;)
08 Feb 2022 11:32:31 (pool-cta-tape-li404a) [] Flush of 000012BB532E18634477997E90E8780C6F52 failed with: CacheException(rc=1;msg=HSM script failed (script reported: 1: sh: line 0: kill: (158822) - No such process;)).
--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   0/0
    from store 0/0
    delete     0/0


[ccdcamcli10] (pool-cta-tape-li404a@ccdcacli404-pool-cta-tape-li404a-Domain) ageorget > flush ls
Class                 Active   Error  Last/min  Requests    Failed

ageorget avatar Feb 08 '22 14:02 ageorget

I downgraded some tape import pools to 6.2.17 to check if the new QueueStat mecanism introduced in 6.2.18 is responsible for this bug [1] I observed that only tape pools with many (>1000) flush requests in queue are affected by this problem. Currently Atlas and CTA who are running lot of tape imports. This is a major issue for us as we have to restart some stuck tape pools each day before they get full.

[1] https://github.com/dcache/dcache/commit/d98a894414d504d3b9e4a33dff683ed85efad9b1

ageorget avatar Feb 10 '22 09:02 ageorget

Thanks for the info. I don't think that the QueueStat affects this, but you never now.. Can you produce a thread dump when a pool runs into such situation?

kofemann avatar Feb 10 '22 09:02 kofemann

I sent you a pool log file with threads dump here https://github.com/dCache/dcache/issues/6426#issuecomment-1024034958

Here is a new one for an Atlas import tape pool : https://box.in2p3.fr/index.php/s/PaHpfyXekisxCqk

--- storagehandler (Nearline storage manager) ---
  Job Queues (active/queued)
    to store   2/0

I also found this error related to QueueStat but it doesn't seem to be link with that bug :

08 Feb 2022 04:27:42 (pool-atlas-tapebuffer-li308a) [] Unexpected failure during state change notification
java.lang.IllegalArgumentException: null
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:121)
        at org.dcache.pool.nearline.NearlineStorageHandler$QueueStat.<init>(NearlineStorageHandler.java:150)
        at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequest.incCancelFromActive(NearlineStorageHandler.java:675)
        at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequest.cancel(NearlineStorageHandler.java:581)
        at org.dcache.pool.nearline.NearlineStorageHandler$AbstractRequestContainer.cancel(NearlineStorageHandler.java:794)
        at org.dcache.pool.nearline.NearlineStorageHandler.stateChanged(NearlineStorageHandler.java:464)
        at org.dcache.pool.repository.v5.StateChangeListeners.lambda$stateChanged$0(StateChangeListeners.java:60)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

ageorget avatar Feb 10 '22 09:02 ageorget

Well, I didn't see any flush problem with the CTA pools downgraded to 6.2.17 and their tape migration activity is still important (71TB and 15k files migrated in the last 24h).

ageorget avatar Feb 11 '22 12:02 ageorget

Thanks good to know. This is not quite clear to me how this change affects the flushing, But yes, might be a race or missing notification. :man_shrugging:

kofemann avatar Feb 11 '22 14:02 kofemann

@ageorget can we do an experiment: start a single flush process, then cancel it with st kill and check that it gone with st ls and process is gone as well.

kofemann avatar Feb 13 '22 18:02 kofemann

It looks good :

[ccdcamcli06] (pool-cms-hpssdata-li275a@ccdcatli275-pool-cms-hpssdata-li275a-Domain) ageorget > st ls
dfa028f5-b645-4964-815c-6967f37a59cf ACTIVE Mon Feb 14 14:52:06 CET 2022 Mon Feb 14 14:52:06 CET 2022 000073A538D5D9404798988D6D5D9DD26575 hpssdata:cms
[ccdcamcli06] (pool-cms-hpssdata-li275a@ccdcatli275-pool-cms-hpssdata-li275a-Domain) ageorget > st kill 000073A538D5D9404798988D6D5D9DD26575
Kill initialized
[ccdcamcli06] (pool-cms-hpssdata-li275a@ccdcatli275-pool-cms-hpssdata-li275a-Domain) ageorget > st ls
[ccdcamcli06] (pool-cms-hpssdata-li275a@ccdcatli275-pool-cms-hpssdata-li275a-Domain) ageorget > flush ls
Class                 Active   Error  Last/min  Requests    Failed
hpssdata:cms@osm           0       1         0         1         0

14 Feb 2022 14:49:06 [flush] [] Flushing SCI=hpssdata:cms@osm;def=false;exp=0;pend=0;maxTotal=0;waiting=1
14 Feb 2022 14:49:20 [pool-32-thread-1053] [] Flushed 0000859002D31B2A4D1A9B0B09AC161CF143 to nearline storage: osm://osm/?store=cchpssdcache:/hpss3/dcache/cms/hpssdata/2022/02/0000859002D31B2A4D1A9B0B09AC161CF143&group=cmsgrid&bfid=
14 Feb 2022 14:52:06 [flush] [] Flushing SCI=hpssdata:cms@osm;def=false;exp=0;pend=0;maxTotal=0;waiting=1
14 Feb 2022 14:52:10 [pool-32-thread-1054] [dfa028f5-b645-4964-815c-6967f37a59cf 000073A538D5D9404798988D6D5D9DD26575] HSM script failed (script reported: 1: )
14 Feb 2022 14:52:10 [pool-32-thread-1054] [] Flush of 000073A538D5D9404798988D6D5D9DD26575 failed with: CacheException(rc=10006;msg=Flush was cancelled.).

And no more flush process

ageorget avatar Feb 14 '22 13:02 ageorget

Hi, Any news about this issue? The WLCG Tape Challenge will happen in the middle of March and I'm thinking to downgrade all dCache import tape pools to 6.2.17 to avoid any trouble. The CTA pools downgraded to 6.2.17 last month never had any flush issue contrary to others pools in 6.2.35.

ageorget avatar Mar 02 '22 09:03 ageorget

Hi @ageorget ,

we are working on this. Unfortunately, any attempts to reproduce it in a lab environment have failed.

kofemann avatar Mar 02 '22 09:03 kofemann

Hi @ageorget ,

probably you have already mentioned that, but I can't find it. Then flushes are stuck what is the result of st ls command and pool's info command. Does of entries returned by list matches the number of shown as to store. In which state does stores?

kofemann avatar Mar 07 '22 10:03 kofemann

Hi @kofemann when the flush is stuck st ls command result is empty, the info command (like flush ls) shows incoherent values for to store (hsm c:puts=1 but storagehandler returned more than 1 active store req) :

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 86400 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   3/0
    from store 0/0
    delete     0/0

ageorget avatar Mar 07 '22 11:03 ageorget

Thanks. Looks like I got a reproduser. 🤞

kofemann avatar Mar 07 '22 11:03 kofemann

@ageorget Not a 100% fix for issues at your site, but probably a partial fix of it is ready. Are you able to run one pool with a patched rpm?

kofemann avatar Mar 11 '22 14:03 kofemann

@kofemann Yes sure.

ageorget avatar Mar 14 '22 10:03 ageorget

@ageorget great! Can you give a try to https://sas.desy.de/index.php/s/9QjHskp7xSrEsNb

kofemann avatar Mar 14 '22 12:03 kofemann

I just deployed it on 2 ATLAS/CMS import tape pools. I'll keep you inform. Thanks for the patch!

ageorget avatar Mar 14 '22 14:03 ageorget

@ageorget Any update? Do you still get flushes to stuck?

kofemann avatar Mar 21 '22 14:03 kofemann

Hi @kofemann I didn't see any problem with the pools upgraded with your patch. So I think it fixed that bug.

ageorget avatar Mar 21 '22 16:03 ageorget

Hi @ageorget ,

The todays (15.03.2022) series of bug fix releases includs the fix. I am closing this issue. Feel free to reopen it if you observe the problem again (hopefully not).

kofemann avatar Mar 25 '22 22:03 kofemann