dcache
dcache copied to clipboard
Only one directory listing active, the rest queued; pnfsmanager.limits.list-threads broke
Dear dCache devs,
I have posted this also on the user forum but there was no reply. However, this issue is affecting our users.
We have configured this value so that 10 directory listings can be processed simultaneously instead of the default 2.
pnfsmanager.limits.list-threads = 10
And this used to work. But now when I look at the list activity, I see only one request active, and the others queued.
[root@dcmain /etc/dcache]# dcache-admin-command PnfsManager 'show list activity'
QUEUED REQUESTS
SOURCE USER PATH
webdav2880-badger14@webdav2880-badger14Domain buswift /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudeb7
webdav2880-badger14@webdav2880-badger14Domain buswift /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudeb7
webdav2880-badger14@webdav2880-badger14Domain buswift /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudee4
webdav2880-badger14@webdav2880-badger14Domain buswift /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudee4
webdav2880-badger14@webdav2880-badger14Domain buswift /pnfs/grid.sara.nl/data/swiftbackup/avans
webdav443-seacow14@webdav443-seacow14Domain tropl1b-proc /pnfs/grid.sara.nl/data/knmi-tropomi/disk/nadc/archive/S5P_ICM_CA_UVN_TRIG/2023/07/07
ACTIVE REQUESTS
SOURCE USER PATH
webdav2880-badger14@webdav2880-badger14Domain buswift /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
I’m afraid this causes problems to our users: see attached graph. We have some extremely large directories (3 million objects), so we absolutely need some parallelism.
Cheers, Onno
We're running 8.2.24.
Looking at the graphs, the queueing seems to have started when we upgraded from 8.2.16 to 8.2.24, at the end of June. Java remained at version 11.0.18.
This looks like it could be related to a8a6912fdb @DmitryLitvintsev . The commit was added with 8.2.23.
Hi Onno,
Does it cause problems to the users or you think is is causing problems?
Here is the description of the patch
pnfsmanager: process listings of the same path sequentially, implement list
Motivation:
Multiple batch jobs executing listings of the same directories especially
if directories contain many entries quickly exhaust active threads, each thr
executing the same listing, denying listing to all other customers and
creating significant load on namespace server (and DB). Clients time out
and retry creating even more load on dCache.
Modification:
The problem is solved by implementing PnfsListDirectoryMessage folding
and dispatching list requests to multiple queues (number of queues is
configurable) based on directory path hashCode. Each queue is served by a si
thread. This guarantees that a listing of the same path will be queued to
the same queue and only one listing of a unique path can be active.
Once listing is complete, the queue is scanned for all other requests for
the same path and these messages are folded (that is their payload is
populated from just completed listing) and routed back to sender without
ever hitting the database.
Ancillary changes include - instead of hardcoded time in the future when
quota and FS stat scan start, use already available settings for durations
between run. This is to avoid triggering these scans when PnfsManager is
restarted often (when testing for instance). And a few "final" qualifiers ad
where suggested by IDE.
Result:
Lists of large directories no longer hog PnfsManage denying service to all
other list calls. Additionally observed significnat reduction of CPU and
IO load on chimera DB host. Patch has been running in production at Fermilab
and has shown to solve the issue of listings timing out.
patch: https://rb.dcache.org/r/13992/
The way it works is when you have an active listing and and another listing for the same directory queued - the queued ones get result of the active once it has completed. So there is parallelism.
It works very well at Fermilab and in fact fixed a lot of our listing hangs (when multiple large directories are listed at the same time)
I use:
pnfsmanager.limits.list-threads = 33
Try that :).
This is how this typically looks on our system:
QUEUED REQUESTS
SOURCE USER PATH
GFTP3-fndca4b-AAYBUPdNoBg@gridftp3-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP9-fndca4b-AAYBUPddRAA@gridftp9-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP10-fndca4b-AAYBUPdf58g@gridftp10-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP4-fndca4b-AAYBUPdxlyA@gridftp4-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox/0/1/1
Xrootd01-fndca4b@xrootd01-fndca4bDomain <unknown> /pnfs/fnal.gov/usr/uboone/persistent/stash/uboonebeam/beam_data/bnb_data/00/02/44/83
GFTP1-fndca4b-AAYBUPeJhCA@gridftp1-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP4-fndca4b-AAYBUPeOv_g@gridftp4-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP9-fndca4b-AAYBUPeY9UA@gridftp9-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP8-fndca4b-AAYBUPe9HxA@gridftp8-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP4-fndca4b-AAYBUPgBhsg@gridftp4-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP8-fndca4b-AAYBUPgmmvg@gridftp8-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP0-fndca4b-AAYBUPgwP7g@gridftp0-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP3-fndca4b-AAYBUPhFMwA@gridftp3-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP6-fndca4b-AAYBUPhFrBg@gridftp6-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP5-fndca4b-AAYBUPhK4CA@gridftp5-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP6-fndca4b-AAYBUPhPHhA@gridftp6-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
GFTP8-fndca4b-AAYBUPhkFUA@gridftp8-fndca4bDomain novapro /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox
Xrootd01-fndca4b@xrootd01-fndca4bDomain <unknown> /pnfs/fnal.gov/usr/minerva/persistent/stash/pidDB/data/00016730
WebDAV00-fndcadoor02@webdav-fndcadoor02Domain uboonepro /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/data/uboone/root-tuple/38
WebDAV00-fndcadoor03@webdav-fndcadoor03Domain uboonepro /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/overlay/uboone/root-tuple/2
ACTIVE REQUESTS
SOURCE USER PATH
WebDAV-fndca4b-2@webdavDomain1 131.225.240.62 /pnfs/fnal.gov/usr/mu2e/resilient/jobsub_stage
WebDAV00-fndcadoor02@webdav-fndcadoor02Domain mnieslon /pnfs/fnal.gov/usr/annie/persistent/users/moflaher/g4dirt_vincentsgenie/BNB_Water_10k_22-05-17
WebDAV00-fndcadoor02@webdav-fndcadoor02Domain uboonepro /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/data/uboone/root-tuple/68
In the above:
listing of /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox is queued but once first of them completes - all of them complete immediately.
Another comment. The patch did not make it so that only one listing is running. The algorithm is as follows. There is a list of threads and the listing of path is dispatched to a thread number defined by:
N = hashcode(path) % pnfsmanager.limits.list-threads
So it is not that there is always just one listing thread running. Yes, there could be situations where large directory has the same hashcode as small directory, and in this case small directory would have to wait for large to be completed. But as I said we have seen very tangible improvement at Fermilab.
If this continue to give you trouble we may make this functionality optional (yet another variable)
Also... one thing to note. I have been doing the same kind of watching of "show list activity" and I must say that this is a little deceptive. Particularly that ACTIVE listing. I have noticed that you only "catch" listings in ACTIVE if they last a few seconds.
E.g. if you think there is only one listing running - try timing a listing while you see one ACTIVE. In my experience you get result right away and never see it being active. You need to use any protocol other than NFS that directly queries the database bypassing PnfsManager (I casually use uberftp ls for this sort of stuff).
P.S.: we generally see more queueing than before but we do not see one user causing DOS by running multiple ls commands (via batch jobs) on large directories (those would starve all available allowed threads). Behavior now - if such a user hits the system his ls requests will queue up until first completes meanwhile giving other user ls requests on other directories a chance to run simultaneously (on paths whose hashcode != hashcode of that users's directory). Once his requests completes all his queued ls requests complete without hitting the database with the result of just completed request. And on that goes. Anyhow, I would like to hear what your users experience. The ever present ls queue should not be a problem. At least it is not a problem for us. We do not have millions of entries per directory. Max about 200K. So I am eager to hear if there are side effects.
[I removed this comment because of the next comment]
I missed this part:
Another comment. The patch did not make it so that only one listing is running. The algorithm is as follows. There is a list of threads and the listing of path is dispatched to a thread number defined by:
N = hashcode(path) % pnfsmanager.limits.list-threads
So, it is possible that when there is a listing that lasts an hour, all listings assigned to the same thread have to wait, while other threads are doing nothing? Actually this does not make sense to me.
This is much better than all threads busy processing N listings of 1 hour not letting anybody in at al. (this is what we had to deal with at Fermilab and hence the patch). With this patch others have a good chance and only 1 out of N long list queries will be actually run on DB. "Things are better than they look"
Besides you observing queuing in your monitoring, do you have customers complaining. This is what the important bit is.
Other things we use:
pnfsmanager.limits.list-chunk-size = 1000
(I think default is 100)
And raise number of list threads higher.
P.S.: Listing query lasting longer than PnfsManager timeout would trigger timeout error in directory list handler (so 1 hour listing is not really possible)
Users haven't complained yet, but I'm sure they will if this continues. I added this metric to our monitoring because users complained before (when we had 2 threads, the default).
I would be very happy if this were configurable, so that we can return to the old algorithm.
Well.... give it a try with the increased list thread count and chunk size. My bet - they won't complain. Based on our experience.
If they don't complain - we won't make it optional :)
Sorry, whatever tuning may bring, I still think it's ugly. You have turned directory listing performance into a game of Russian roulette.
Hi Onno,
I respect your opinion, but it works much better for us. We would have PnfsManager DOS weekly. Since we had the patch deployed - we never had any issues. So give it a try first.
BTW: just to add to it - GetFileAtrtributes messages have been processed exactly like that for many years. I just extended the same approach to listing.
Dmitry
We increased the threads from 10 to 100 and the chunk size from 100 to 10,000. We've also tried chunk size 1,000, and 10,000 does not seem to perform different than 1,000, so we'll probably revert to 1,000 during next maintenance.
We see that the maximum waiting time during peak hour is reduced from ~50 minutes to ~15 minutes. The srmls timeout is 2 minutes though, so we're not out of trouble yet.
Now waiting to hear from users.
I think 10K chunk size is a little excessive. How is the memory footprint of the PnfsManager. The numbers you quoted ~ 50 minutes seems staggering to me.
Could you post:
"\sn show list activity -times"
and
"\sn info -a"
output. And, make sure this variable is not set to false:
pnfsmanager.enable.folding
(default is true). Folding is absolutely crucial here.
Also, the 50 minute wait. Do I get it correct you have 1M entries in directory and lists for that directory (and all the unfortunate ones that have the same hashcode % number_of_threads value) queue up?
And in the past having just 2 list threads you had no issues? Of course if your experience is so drastically worse than before I can make the functionality optional, or consider improvements to the algorithm.
The "info -a" output @ Fermilab public dCache system:
Statistics:
PnfsManagerV3 average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period
PnfsListDirectoryMessage 275.90± 0.32 1 127,943 1,314.32 17,254,002 12 days
...
As you can see max about 2 minutes. And ..... I think anything longer would just fail.
and folding stats
PnfsManagerV3.Folded requests failed
PnfsGetFileAttributes 6392185 0
PnfsListDirectoryMessage 2779888 0
Total 9172073 0
Whereas CMS T1 instance :
Statistics:
PnfsManagerV3 average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period
PnfsListDirectoryMessage 4.71± 0.01 1 10,395 24.92 15,701,235 35 days
PnfsManagerV3.Folded requests failed
PnfsGetFileAttributes 140852 0
PnfsListDirectoryMessage 3 0
Total 140855 0
Typically I see no activity :
[ssh -p 22224 admin@cmsdcadiskhead "\sn show list activity -times"
Currently no list activity
The 3 million object dir listing usually happens only a few times per day. It's a backup from an object store. That explains why it's so many objects in a single dir: object stores usually don't care about a directory structure.
Indeed, all dir listings that were so unlucky to be assigned to the same thread, will have to wait for that listing to complete. Even with 100 threads, I can see around ~10 listings being queued.
It's a long time ago that we had the default of 2 threads. And yes, we had problems back then, that's why we increased it to 10. With 10, we had no problems, until the algorithm change in 8.2.23.
Srmls has a timeout of 2 minutes, but WebDAV operations don't. The 3 million objects listing is done through WebDAV. We've had to set up a special WebDAV door with a lot of memory, because otherwise it would OOM while providing the dir listing.
dcache.java.memory.heap=250G
dcache.java.memory.direct=100G
I'm aware we're pushing the boundaries here. 😉
The PnfsManager up to now had 8GB of heap, we've just increased it to 50 GB. (No OOM here though; just trying to avoid garbage collection)
The show list activity -times currently is empty, I'll try again during "rush hour".
[root@dcmain ~]# dcache-admin-command PnfsManager 'info -a'
--- ha-service-leadership-manager (Coordinates which pnfsmanager updates the filesystem stat cache) ---
HA role: LEADER
--- name-space-provider (Name space provider) ---
Acl Enabled: true
DB : HikariDataSource (HikariPool-1)
DB Engine : PostgreSQL 11.20
rootID : 000000000000000000000000000000000000
FsId : 0
Statistics:
ChimeraNameSpaceProvider average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period
addInodeLocation(FsInode|int|Strin 0.10± 0.00 0 60 0.31 2,648,476 28 hours
clearInodeLocation(FsInode|int|Str 0.07± 0.00 0 108 0.29 1,986,657 28 hours
createFile(FsInode|String|int|int| 0.59± 0.00 0 616 0.88 4,394,414 28 hours
find(FsInode) 0.03± 0.00 0 148 0.19 85,238,126 28 hours
getACL(FsInode) 0.03± 0.00 0 136 0.18 221,302,823 28 hours
getAllTags(FsInode) 0.05± 0.00 0 339 0.23 72,274,940 28 hours
getInfo() 0.01± 0.00 0 10 0.15 11,946 28 hours
getInodeChecksums(FsInode) 0.04± 0.00 0 185 0.19 72,292,628 28 hours
getInodeLocations(FsInode) 0.04± 0.00 0 171 0.21 69,568,667 28 hours
getInodeLocations(FsInode|int) 0.29± 0.13 0 1 0.47 14 26 hours
getLabels(FsInode) 0.16± 0.01 0 1 0.37 1,638 28 hours
getStorageInfo(FsInode) 0.04± 0.00 0 15 0.19 1,959,977 28 hours
getXattr(FsInode|String) 0.05± 0.00 0 219 0.23 51,344,696 28 hours
id2inode(String|StatCacheOption) 0.04± 0.00 0 123 0.21 57,420,721 28 hours
inode2id(FsInode) 0.00± 0.00 0 13 0.04 18,723,880 28 hours
inode2path(FsInode) 0.13± 0.00 0 15 0.34 5,047,577 28 hours
inodeOf(FsInode|String|StatCacheOp 0.07± 0.00 0 185 0.28 2,240,736 28 hours
listXattrs(FsInode) 0.05± 0.00 0 186 0.23 81,679,273 28 hours
mkdir(FsInode|String|int|int|int) 0.49± 0.00 0 9 0.60 18,719 28 hours
mkdir(FsInode|String|int|int|int|L 2.53± 0.09 0 30 2.69 1,000 28 hours
newDirectoryStream(FsInode) 14.06± 0.54 0 7,718 97.65 32,246 28 hours
path2inode(String) 0.21± 0.00 0 63 0.42 4,918,729 28 hours
path2inodes(String) 0.30± 0.00 0 188 0.50 30,173,679 28 hours
read(FsInode|int|long|byte[]|int|i 0.05± 0.00 0 1 0.21 5,711 28 hours
remove(FsInode|String|FsInode) 0.69± 0.00 0 409 1.61 1,982,024 28 hours
rename(FsInode|FsInode|String|FsIn 0.21± 0.00 0 7 0.42 38,622 28 hours
resolvePath(String) 0.29± 0.00 0 190 0.51 12,032,066 28 hours
setInodeAttributes(FsInode|int|Sta 0.12± 0.00 0 54 0.35 2,635,645 28 hours
setInodeChecksum(FsInode|int|Strin 0.06± 0.00 0 38 0.25 2,636,635 28 hours
setStorageInfo(FsInode|InodeStorag 0.36± 0.02 0 1 0.48 480 28 hours
setXattr(FsInode|String|byte[]|Set 0.13± 0.00 0 93 0.35 2,553,683 28 hours
stat(FsInode|int) 0.05± 0.00 0 75 0.23 8,276,882 28 hours
updateFsStat() 37,629.11± 718.01 35,251 51,974 3,799.34 28 27 hours
ChimeraNameSpaceProvider requests failed
addInodeLocation(FsInode|int|Strin 2648476 34
clearInodeLocation(FsInode|int|Str 1986657 0
createFile(FsInode|String|int|int| 4394414 1758952
find(FsInode) 85238126 0
getACL(FsInode) 221302823 0
getAllTags(FsInode) 72274940 0
getInfo() 11946 0
getInodeChecksums(FsInode) 72292628 0
getInodeLocations(FsInode) 69568667 0
getInodeLocations(FsInode|int) 14 0
getLabels(FsInode) 1638 0
getStorageInfo(FsInode) 1959977 0
getXattr(FsInode|String) 51344696 0
id2inode(String|StatCacheOption) 57420721 1923593
inode2id(FsInode) 18723880 0
inode2path(FsInode) 5047577 0
inodeOf(FsInode|String|StatCacheOp 2240736 257555
listXattrs(FsInode) 81679273 0
mkdir(FsInode|String|int|int|int) 18719 10792
mkdir(FsInode|String|int|int|int|L 1000 0
newDirectoryStream(FsInode) 32246 0
path2inode(String) 4918729 217914
path2inodes(String) 30173679 2144632
read(FsInode|int|long|byte[]|int|i 5711 0
remove(FsInode|String|FsInode) 1982024 1
rename(FsInode|FsInode|String|FsIn 38622 0
resolvePath(String) 12032066 0
setInodeAttributes(FsInode|int|Sta 2635645 116
setInodeChecksum(FsInode|int|Strin 2636635 0
setStorageInfo(FsInode|InodeStorag 480 0
setXattr(FsInode|String|byte[]|Set 2553683 0
stat(FsInode|int) 8276882 5618667
updateFsStat() 28 0
Total 813443338 11932256
--- pnfs-manager-ionotify ---
atime precision: Disabled
List queue: 0
Threads (24) Queue
[0] 0
[1] 0
[2] 0
[3] 0
[4] 0
[5] 0
[6] 0
[7] 0
[8] 0
[9] 0
[10] 0
[11] 0
[12] 0
[13] 0
[14] 0
[15] 0
[16] 0
[17] 0
[18] 0
[19] 0
[20] 0
[21] 0
[22] 0
[23] 0
Threads: 0
Statistics:
PnfsManagerV3 average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period
PnfsAddCacheLocationMessage 0.38± 0.02 0 4 0.52 915 28 hours
PnfsCancelUpload 3.05± 0.50 1 28 3.73 55 28 hours
PnfsClearCacheLocationMessage 0.21± 0.00 0 109 0.50 1,986,987 28 hours
PnfsCommitUpload 2.87± 0.05 2 23 1.46 945 28 hours
PnfsCreateEntryMessage 1.65± 0.00 0 641 1.41 4,421,978 28 hours
PnfsCreateSymLinkMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsCreateUploadPath 3.86± 0.09 1 32 3.00 1,000 28 hours
PnfsDeleteEntryMessage 1.71± 0.00 0 432 2.07 2,200,809 28 hours
PnfsFlagMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsGetCacheLocationsMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsGetFileAttributes 0.71± 0.00 0 189 0.81 30,810,075 28 hours
PnfsGetParentMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsListDirectoryMessage 591.89± 64.05 1 967,950 11,537.54 32,448 28 hours
PnfsListExtendedAttributesMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsMapPathMessage 0.25± 0.00 0 18 0.43 2,432,662 28 hours
PnfsReadExtendedAttributesMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsRemoveChecksumMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsRemoveExtendedAttributesMessag NaN± NaN 0 0 NaN 0 28 hours
PnfsRemoveLabelsMessage NaN± NaN 0 0 NaN 0 28 hours
PnfsRenameMessage 1.72± 0.00 0 60 0.97 37,677 28 hours
PnfsSetFileAttributes 0.58± 0.00 0 363 0.92 2,647,360 28 hours
PnfsWriteExtendedAttributesMessage NaN± NaN 0 0 NaN 0 28 hours
PoolFileFlushedMessage 0.83± 0.04 0 16 0.86 480 28 hours
PnfsManagerV3.Folded requests failed
PnfsGetFileAttributes 176923 0
PnfsListDirectoryMessage 38 0
Total 176961 0
Max 15 minutes for directory listing from above. Very little folding, meaning the same directory is not being hit often enough to trigger folding. OK, Lets wait for rush hour data.
P.S.: srmls timeout 2 minute. What are those? Are they "gfal-ls" executed by some SAM tests? And 2 minute is "their", that is client timeout? Or is it our internal timeout:
srmmanager.service.pnfsmanager.timeout = 120
I believe that when this one is hit srmls fails the request but of course PnfsManager still keeps processing it.
So I suggest to raise the above timeout to 15 minutes (based on your "info -a" output)
similarly for webdav:
webdav.service.pnfsmanager.timeout = 120000
(mind the units). Same 2 minute timeout. Probably needs to be raised to 15 minutes. as well.
I missed the rush hour, will try again tomorrow.
But saw a large spike in our listing queue graphs. It appears we had a listing that lasted 4028 seconds. I didn't expect that. In the end, 18 other listings were waiting for this one to finish.
[root@install-fes ~]# dcache-admin-command PnfsManager 'info -a' | grep PnfsListDirectoryMessage
PnfsListDirectoryMessage 708.14± 81.45 1 4,028,875 20,249.28 61,801 2 days
PnfsListDirectoryMessage 74 0
Question. Can you identify the large directory that takes 4K seconds to process and see what protocol has issued the listing (srm, webdav?) and that this listing indeed succeeded and did not fail with timeout. Also, has the queue ever contain the same large directory (which could be an indication of retry by the door).
By now I realize that your usage pattern is different - occasionally or periodically your system runs a long listing and with the new algorithm it backs few unfortunate others up while it is running. This is opposed to our use case when many users run listing on the same, sometimes largish directories at the same time. We typically ask users to keep entries under 10K. There are few cases of ~ up to 200K but never millions.
I think this calls for making the algorithm optional for now.
At the same time I can see if I can make it more sophisticated.
NB: previous algorithm would fail if you gave 10 such queries running, all 10 threads will be locked for 4K seconds not allowing any other listings to go through. Increasing number of listing threads would kill the DB. This is basically what we had run into.
I think we have a very good informative listing here. I ran the command in a loop every minute for some hours. This is the last listing with a very long listing operation, right before it completed and the queue was drained. When finally the very large dir had finished, all 22 waiting list operations completed within one minute (because the next listing a minute later was completely empty).
Thu Aug 3 11:40:19 CEST 2023
QUEUED REQUESTS
SOURCE USER ARRIVED STARTED PATH
SrmManager@srmmanagerDomain lofops 2023-08-03 10:35:57.209 (64 min ago) /pnfs/grid.sara.nl/data/lofar/ops/projects/lc0_003/151812
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:57:58.960 (42 min ago) /pnfs/grid.sara.nl/data/swiftbackup/rivm/owncloudc02
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:59:59.006 (40 min ago) /pnfs/grid.sara.nl/data/swiftbackup/rivm/owncloudc02
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:00:00.537 (40 min ago) /pnfs/grid.sara.nl/data/swiftbackup/uva/owncloudf70
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:02:00.568 (38 min ago) /pnfs/grid.sara.nl/data/swiftbackup/uva/owncloudf70
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:03:34.951 (36 min ago) /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud18c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:05:34.977 (34 min ago) /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud18c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:07:08.401 (33 min ago) /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloud9ad
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:08:04.911 (32 min ago) /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud3f3
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:09:08.467 (31 min ago) /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloud9ad
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:10:05.194 (30 min ago) /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud3f3
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:11:17.583 (29 min ago) /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudacb
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:13:17.676 (27 min ago) /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudacb
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:18:59.372 (21 min ago) /pnfs/grid.sara.nl/data/swiftbackup/buas/owncloud21c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:20:55.467 (19 min ago) /pnfs/grid.sara.nl/data/swiftbackup/zuyd/owncloude13
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:20:59.534 (19 min ago) /pnfs/grid.sara.nl/data/swiftbackup/buas/owncloud21c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:22:40.392 (17 min ago) /pnfs/grid.sara.nl/data/swiftbackup/che/owncloud894
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:22:55.520 (17 min ago) /pnfs/grid.sara.nl/data/swiftbackup/zuyd/owncloude13
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:24:40.459 (15 min ago) /pnfs/grid.sara.nl/data/swiftbackup/che/owncloud894
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:26:54.862 (13 min ago) /pnfs/grid.sara.nl/data/swiftbackup/hvhl/owncloud593
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:28:54.906 (11 min ago) /pnfs/grid.sara.nl/data/swiftbackup/hvhl/owncloud593
SrmManager@srmmanagerDomain lofops 2023-08-03 11:30:59.602 (9 min ago) /pnfs/grid.sara.nl/data/lofar/ops/projects/lc0_003/183367
ACTIVE REQUESTS
SOURCE USER ARRIVED STARTED PATH
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:33:17.862 (67 min ago) 2023-08-03 10:33:17.862 (67 min ago) /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
Here's the associated graph:
Does /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751 have millions of entries? And do you have a way of finding our that the listing actually succeeded? (see my previous message). Before I move on to touch the code I wanted to make sure that we are not seeing PnfsManager just spinning its wheels? I think WebDAV door access file would have an answer (if listing completed or timed out).
(The more I think about a listing that lasts > 1 hour, the less I am convinced that it is actually needed and could be an artifact of some other activity, like inadvertent "ls" whose only purpose is establishing that destination directory exists. We have seen something like that at Fermilab).
I can see that the dir listing gets an OOM error. It's already running with these settings (note the heap & direct memory, we've had to increase that several times in the past):
root 3152466 1 28 05:43 ? 01:23:22 /usr/bin/java -server -Xmx250G -XX:MaxDirectMemorySize=100G -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33000:34500 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/webdav2880-badger14Domain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.2.jar -Djava.net.preferIPv6Addresses=system -Djava.awt.headless=true -DwantLog4jSetup=n -Dorg.bouncycastle.dh.allow_unsafe_p_value=true -Djdk.tls.ephemeralDHKeySize=2048 -Djava.net.preferIPv6Addresses=false -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start webdav2880-badger14Domain
Here's the error:
07 Aug 2023 10:32:52 (webdav2880-badger14) [] /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
java.lang.OutOfMemoryError: null
at java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:119)
at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
at java.base/java.io.OutputStream.write(OutputStream.java:122)
at io.milton.http.XmlWriter.append(XmlWriter.java:73)
at io.milton.http.XmlWriter.writeElement(XmlWriter.java:421)
at io.milton.http.XmlWriter.writeElement(XmlWriter.java:127)
at io.milton.http.XmlWriter.writeProperty(XmlWriter.java:114)
at io.milton.http.values.ToStringValueWriter.writeValue(ToStringValueWriter.java:52)
at io.milton.http.values.ValueWriters.writeValue(ValueWriters.java:98)
at io.milton.http.webdav.PropFindXmlGeneratorHelper.sendProperties(PropFindXmlGeneratorHelper.java:124)
at io.milton.http.webdav.PropFindXmlGeneratorHelper.sendKnownProperties(PropFindXmlGeneratorHelper.java:114)
at io.milton.http.webdav.PropFindXmlGeneratorHelper.appendResponse(PropFindXmlGeneratorHelper.java:102)
at io.milton.http.webdav.PropFindXmlGeneratorHelper.appendResponses(PropFindXmlGeneratorHelper.java:91)
at io.milton.http.webdav.PropFindXmlGenerator.generate(PropFindXmlGenerator.java:60)
at io.milton.http.webdav.PropFindXmlGenerator.generate(PropFindXmlGenerator.java:69)
at io.milton.http.webdav.DefaultWebDavResponseHandler.respondPropFind(DefaultWebDavResponseHandler.java:202)
at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
at org.dcache.webdav.AcceptAwareResponseHandler.respondPropFind(AcceptAwareResponseHandler.java:116)
at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
at org.dcache.webdav.WorkaroundsResponseHandler.respondPropFind(WorkaroundsResponseHandler.java:138)
at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
at io.milton.http.webdav.PropFindHandler.processExistingResource(PropFindHandler.java:156)
at io.milton.http.ResourceHandlerHelper.processResource(ResourceHandlerHelper.java:196)
at io.milton.http.webdav.PropFindHandler.processResource(PropFindHandler.java:112)
at io.milton.http.ResourceHandlerHelper.process(ResourceHandlerHelper.java:127)
at org.dcache.webdav.DcacheResourceHandlerHelper.process(DcacheResourceHandlerHelper.java:42)
at io.milton.http.webdav.PropFindHandler.process(PropFindHandler.java:106)
at org.dcache.webdav.DcacheStandardFilter.process(DcacheStandardFilter.java:50)
at io.milton.http.FilterChain.process(FilterChain.java:46)
at org.dcache.webdav.transfer.CopyFilter.process(CopyFilter.java:276)
at io.milton.http.FilterChain.process(FilterChain.java:46)
at io.milton.http.HttpManager.process(HttpManager.java:158)
at org.dcache.webdav.MiltonHandler.handle(MiltonHandler.java:77)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.dcache.http.AuthenticationHandler.access$001(AuthenticationHandler.java:55)
at org.dcache.http.AuthenticationHandler.lambda$handle$0(AuthenticationHandler.java:156)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
at org.dcache.http.AuthenticationHandler.handle(AuthenticationHandler.java:153)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.dcache.http.AbstractLoggingHandler.handle(AbstractLoggingHandler.java:110)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:555)
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:410)
at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:164)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.base/java.lang.Thread.run(Thread.java:829)
When I did an ls in an NFS-mounted /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751, it was like a DoS of our PnfsManager. The transfer success rate (for everyone, not only this user) dropped to 20%. After I restarted the PnfsManager things got back to normal. I don't know how NFS interacts with the PnfsManager, but I bet it's not through your algorithm 😉
I have asked the user to tell me how many files are supposed to be in this dir. Perhaps on the client side he might be able to see this more easily.
The client is Rclone. I think Rclone wants to get a dir listing because it only uploads files that don't exist yet in the destination dir, or have changed since last upload.
NFS by-passes PnfsManager altogether when running ls.
You can check number of files by doing :
select inlink from t_inodes where ipnfsid = '<pnfsid of that directory>';
So, I think, is is fair to conclude that you have a problem with large directory that is:
- being listed and never completes
- impacts other listings due to algorithm change. Again the algorithm has been changed to avoid situations where multiple listings of the same large directory hog all available threads. And it does work nicely. But we do not have > 200K entries per directory. I think I will just add a new variable to make this behavior optional. Sorry for not anticipating your use case.
P.S.: I was surprised that the listing is not just timing out on the door end. May be its hits OOM sooner!?
[root@db1 ~]# psql -U postgres chimera -c "select inlink from t_inodes where ipnfsid = '00005D8D033B902A42249F3A7510D1146E28'"
inlink
----------
13961036
(1 row)
Wow, almost 14 million... Even more than I thought.
Thanks for looking into this. In short term it would be wonderful if the algorithm is configurable. For the future, I foresee more problems with such a huge and growing directory. We might need to set up a separate instance for them so they don't DoS other users. But that's a big operation that we can't do on short notice.
Returning from holiday, I became aware of this issue and ticket. To re-enforce Onno's point, I'd like to share that our case (LOFAR Long-Term Archive) is certainly affected by the new behaviour. LOFAR is significantly different from the FERMI case as I understand it.
For LOFAR
- Files are generally structured hierarchically in the archive with a few hundred to a few thousand files per directory
- We do not allow our users to browse the archive (instead, file locations are provided through our archive database), therefore multiple listings for the same directory are uncommon
- We do operate services that do many (>> 100) listings on directories for indexing, verification, and consistency checking As a result, up to one in every so many listings is now expected to time out with the new algorithm.
As this looks like spurious behaviour it was not identified until now, although we have seen a slight rise in unexplained operations that take much longer, or are failing, than we are used to for SURF. Looking into the logs for one of our services, it is nevertheless clear that starting July 1st the following error is being reported multiple times per day for listings at SURF: "List failed [rc=10006,msg=Timeout during directory listing.]"
From the thread I conclude that the issue is planned to be addressed by allowing the algorithm for directory listings to be made configurable. For LOFAR this fix will be very important and much appreciated.
The problem will hit you when you used up all available threads when using "old algorithm", increasing the number of list threads will lead to database thrashing leading to timeouts. But yes, we will make the behavior of listing optional ASAP.
Pull request has been made.
@onnozweers are there subdirectories under /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
Not directly related to issue, but we seem to be seeing long running listings by Webdav clients. We suspect that PROPFIND calls without Depth header result in recursive listing.
This is because even 15M entries directory just with files should have completed faster than 70 minutes.
P.S.: NVM, not recursive listing, but door receives your 15M entries and starts to loop over them sending PROPFIND request for each of them. This is where all the time is spent. I just got to the bottom of it.
We saw the patch in 8.2.28. Thanks a lot! I really appreciate the effort you put into this. We'll update soon.
About the PROPFIND, not sure what you mean.
Do you mean that the door, while getting entries from the PnfsManager, also starts sending PROPFIND results? And that these slow down the listing operation in the PnfsManager?
Anyway, let me know if you need my help. I'll be out of office for a few weeks though.