dcache
dcache copied to clipboard
TAPE pool full precious data and not storage request to HSM
Hello,
It is observed recently that write pools interacting with the TAPE system contain a large quantity of files precious state the Storage requests are not active, the pool starts to accumulate files and use space without flushing to tape. dCache is using ENDIT archiver to make store and stage request to the TAPE system.
For example:
Storage class Total: size, files; Precious: size, files; Sticky: size, files; others: size, files
bnlt1d0:BNLT1D0@osm 313GiB 90 18GiB 2 0B 0 296GiB 88
MCTAPE:MC@osm 20TiB 13086 19TiB 11839 0B 0 1.1TiB 1247
Only for recently files stored on the write pool there is a subsequent storage request.
st ls
c6e46e61-9436-4e4d-bee7-0fe830c7a32e ACTIVE Fri Apr 05 12:31:40 EDT 2024 Fri Apr 05 12:31:40 EDT 2024 0000AFD25C38F32B42A29D57AC8E7A444DE6 bnlt1d0:BNLT1D0
The flush queue
flush ls
Class Active Error Last/min Requests Failed
bnlt1d0:BNLT1D0@osm 0 0 0 1 0
MCTAPE:MC@osm 1 0 3814 11838 0
This is the current time out for limits
--- flush (Controller for centralising flushing) ---
Flush interval : 60000 ms
Maximum classes flushing : 1000
Minimum flush delay on error : 60000 ms
Next flush : 2024-04-05 12:32:40.418 (17 s in the future)
--- storagehandler (Nearline storage manager) ---
Restore Timeout : 518400 seconds
Store Timeout : 14400 seconds
Remove Timeout : 14400 seconds
Job Queues (active/queued)
to store 2/0
from store 0/0
delete 0/0
It seems that the current timeout for files waiting to be flush is too short?
After going over the the ~ 11845 files and using the flush pnfsid command the ST request is created and appears to activate the flush.
-
It is not clear to me the interaction of the storagehandler and the Flush, it seems that even if the file is marked as precious the flush is not created unless there is an ACTIVE storage request created.
-
Is there any bulk command to issue a storage request to files per storage class? Currently flush pnfsid is the one that I was able to reactive the Storage Request.
Could you please advise?
All the best, Carlos
Hi Carlos,
as said on today's T1 support meeting, I see two possible issues for you here:
- Your requests have been disabled/suspended for some reason
- Your pool tries to batch the flush tasks, without reaching any of the batch limits
To check the first, run queue ls -l queue on your pool and see whether any "deactivated requests" are listed. If so, then I presume that the pool has attempted to flush these files, but encountered an error which made the pool suspend the store task. You can reactivate suspended tasks using queue activate.
On the second part, queue ls classes -l will show you all (current) classes the pool tracks and the individual metrics that matter for batching. The parameter for that can be configured through queue define class.
For us, we don't want the pools to do batching for us, which is why we have included the line queue define class -open osm *. This tells the pool to immediately flush files for all classes immediately, without any batching.
Hope this helps you,
Xavier.
Hello @XMol
Thank you for your comments and explanations also mentioned in today's t1 dev meeting ;).
It seems that we are enabling the pools to batch, this example show the pools in a steady state
[dccore02] (dc253_10@dc253tenDomain) admin > rep ls -l=p -s
Storage class Total: size, files; Precious: size, files; Sticky: size, files; others: size, files
bnlt1d0:BNLT1D0@osm 2.9TiB 958 1.9TiB 543 0B 0 957GiB 415
MCTAPE:MC@osm 17TiB 8566 3.7TiB 2505 0B 0 13TiB 6061
[dccore02] (dc253_10@dc253tenDomain) admin > queue ls classes -l
Class@Hsm : bnlt1d0:BNLT1D0@osm
isOpen : false
Expiration rest/defined : -130035 / 0 seconds
Pending rest/defined : 543 / 0
Size rest/defined : 2140361670867 / 0
Active Store Procs. : 543
Class@Hsm : MCTAPE:MC@osm
isOpen : false
Expiration rest/defined : -131830 / 0 seconds
Pending rest/defined : 2505 / 0
Size rest/defined : 4122633706385 / 0
Active Store Procs. : 1579
[dccore02] (dc253_10@dc253tenDomain) admin > flush ls
Class Active Error Last/min Requests Failed
bnlt1d0:BNLT1D0@osm 543 0 219 543 0
MCTAPE:MC@osm 1578 0 288 2504 0
Currently
For MCTAPE:MC@osm It seems that is batching, do you know where the batching limit is set? is dCache doing this automatically
All the best, Carlos
@cfgamboa said:
For
MCTAPE:MC@osmIt seems that is batching, do you know where the batching limit is set?automatically
What do you mean? You showed the limits yourself!
- Both
bnlt1d0:BNLT1D0@osmandMCTAPE:MC@osmstorage classes have the same flush configuration- Both do batching (
isOpen: false) - Expiration limit is set to 0, which means, flush tasks are not batched by life time.
- Pending limit is set to 0, which means, flush tasks are not batched by file number.
- Size limit is set to 0, which means, flush tasks are not batched by total data volume in the class.
- Both do batching (
is dCache doing this
Yes, it is. This configuration is the default for all classes that dCache implicitly creates as soon as data come onto the pool.
Ciao,
Xavier.
Hey @XMol
Thank you!
Carlos