dcache Only one directory listing active, the rest queued; pnfsmanager.limits.list-threads broke

Dear dCache devs,

I have posted this also on the user forum but there was no reply. However, this issue is affecting our users.

We have configured this value so that 10 directory listings can be processed simultaneously instead of the default 2.

pnfsmanager.limits.list-threads = 10

And this used to work. But now when I look at the list activity, I see only one request active, and the others queued.

[root@dcmain /etc/dcache]# dcache-admin-command PnfsManager 'show list activity'
QUEUED REQUESTS

             SOURCE                           USER         PATH
webdav2880-badger14@webdav2880-badger14Domain buswift      /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudeb7                           
webdav2880-badger14@webdav2880-badger14Domain buswift      /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudeb7                           
webdav2880-badger14@webdav2880-badger14Domain buswift      /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudee4                           
webdav2880-badger14@webdav2880-badger14Domain buswift      /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudee4                           
webdav2880-badger14@webdav2880-badger14Domain buswift      /pnfs/grid.sara.nl/data/swiftbackup/avans                                            
 webdav443-seacow14@webdav443-seacow14Domain  tropl1b-proc /pnfs/grid.sara.nl/data/knmi-tropomi/disk/nadc/archive/S5P_ICM_CA_UVN_TRIG/2023/07/07

ACTIVE REQUESTS

             SOURCE                           USER         PATH
webdav2880-badger14@webdav2880-badger14Domain buswift      /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751

I’m afraid this causes problems to our users: see attached graph. We have some extremely large directories (3 million objects), so we absolutely need some parallelism.

Cheers, Onno

Jul 24 '23 07:07 onnozweers

We're running 8.2.24.

Jul 24 '23 07:07 onnozweers

Looking at the graphs, the queueing seems to have started when we upgraded from 8.2.16 to 8.2.24, at the end of June. Java remained at version 11.0.18.

Jul 24 '23 11:07 onnozweers

This looks like it could be related to a8a6912fdb @DmitryLitvintsev . The commit was added with 8.2.23.

Jul 25 '23 10:07 lemora

Hi Onno,

Does it cause problems to the users or you think is is causing problems?

Here is the description of the patch

    pnfsmanager: process listings of the same path sequentially, implement list 
    
    Motivation:
    
    Multiple batch jobs executing listings of the same directories especially
    if directories contain many entries quickly exhaust active threads, each thr
    executing the same listing, denying listing to all other customers and
    creating significant load on namespace server (and DB). Clients time out
    and retry creating even more load on dCache.
    
    Modification:
    
    The problem is solved by implementing PnfsListDirectoryMessage folding
    and dispatching list requests to multiple queues (number of queues is
    configurable) based on directory path hashCode. Each queue is served by a si
    thread. This guarantees that a listing of the same path will be queued to
    the same queue and only one listing of a unique path can be active.
    Once listing is complete, the queue is scanned for all other requests for
    the same path and these messages are folded (that is their payload is
    populated from just completed listing) and routed back to sender without
    ever hitting the database.
    
    Ancillary changes include - instead of hardcoded time in the future when
    quota and FS stat scan start, use already available settings for durations
    between run. This is to avoid triggering these scans when PnfsManager is
    restarted often (when testing for instance). And a few "final" qualifiers ad
    where suggested by IDE.
    
    Result:
    
    Lists of large directories no longer hog PnfsManage denying service to all
    other list calls. Additionally observed significnat reduction of CPU and
    IO load on chimera DB host. Patch has been running in production at Fermilab
    and has shown to solve the issue of listings timing out.
    
    patch: https://rb.dcache.org/r/13992/

The way it works is when you have an active listing and and another listing for the same directory queued - the queued ones get result of the active once it has completed. So there is parallelism.

It works very well at Fermilab and in fact fixed a lot of our listing hangs (when multiple large directories are listed at the same time)

I use:

pnfsmanager.limits.list-threads = 33

Try that :).

Jul 25 '23 14:07 DmitryLitvintsev

This is how this typically looks on our system:

QUEUED REQUESTS

                    SOURCE                          USER           PATH
 GFTP3-fndca4b-AAYBUPdNoBg@gridftp3-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP9-fndca4b-AAYBUPddRAA@gridftp9-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
GFTP10-fndca4b-AAYBUPdf58g@gridftp10-fndca4bDomain  novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP4-fndca4b-AAYBUPdxlyA@gridftp4-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox/0/1/1                           
          Xrootd01-fndca4b@xrootd01-fndca4bDomain   <unknown>      /pnfs/fnal.gov/usr/uboone/persistent/stash/uboonebeam/beam_data/bnb_data/00/02/44/83          
 GFTP1-fndca4b-AAYBUPeJhCA@gridftp1-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP4-fndca4b-AAYBUPeOv_g@gridftp4-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP9-fndca4b-AAYBUPeY9UA@gridftp9-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP8-fndca4b-AAYBUPe9HxA@gridftp8-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP4-fndca4b-AAYBUPgBhsg@gridftp4-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP8-fndca4b-AAYBUPgmmvg@gridftp8-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP0-fndca4b-AAYBUPgwP7g@gridftp0-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP3-fndca4b-AAYBUPhFMwA@gridftp3-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP6-fndca4b-AAYBUPhFrBg@gridftp6-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP5-fndca4b-AAYBUPhK4CA@gridftp5-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP6-fndca4b-AAYBUPhPHhA@gridftp6-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
 GFTP8-fndca4b-AAYBUPhkFUA@gridftp8-fndca4bDomain   novapro        /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox                                 
          Xrootd01-fndca4b@xrootd01-fndca4bDomain   <unknown>      /pnfs/fnal.gov/usr/minerva/persistent/stash/pidDB/data/00016730                               
      WebDAV00-fndcadoor02@webdav-fndcadoor02Domain uboonepro      /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/data/uboone/root-tuple/38           
      WebDAV00-fndcadoor03@webdav-fndcadoor03Domain uboonepro      /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/overlay/uboone/root-tuple/2         

ACTIVE REQUESTS

                    SOURCE                          USER           PATH
          WebDAV-fndca4b-2@webdavDomain1            131.225.240.62 /pnfs/fnal.gov/usr/mu2e/resilient/jobsub_stage                                                
      WebDAV00-fndcadoor02@webdav-fndcadoor02Domain mnieslon       /pnfs/fnal.gov/usr/annie/persistent/users/moflaher/g4dirt_vincentsgenie/BNB_Water_10k_22-05-17
      WebDAV00-fndcadoor02@webdav-fndcadoor02Domain uboonepro      /pnfs/fnal.gov/usr/uboone/scratch/uboonepro/dropbox/merge/data/uboone/root-tuple/68

In the above:

listing of /pnfs/fnal.gov/usr/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox is queued but once first of them completes - all of them complete immediately.

Another comment. The patch did not make it so that only one listing is running. The algorithm is as follows. There is a list of threads and the listing of path is dispatched to a thread number defined by:

N = hashcode(path) % pnfsmanager.limits.list-threads

So it is not that there is always just one listing thread running. Yes, there could be situations where large directory has the same hashcode as small directory, and in this case small directory would have to wait for large to be completed. But as I said we have seen very tangible improvement at Fermilab.

If this continue to give you trouble we may make this functionality optional (yet another variable)

Also... one thing to note. I have been doing the same kind of watching of "show list activity" and I must say that this is a little deceptive. Particularly that ACTIVE listing. I have noticed that you only "catch" listings in ACTIVE if they last a few seconds.

E.g. if you think there is only one listing running - try timing a listing while you see one ACTIVE. In my experience you get result right away and never see it being active. You need to use any protocol other than NFS that directly queries the database bypassing PnfsManager (I casually use uberftp ls for this sort of stuff).

P.S.: we generally see more queueing than before but we do not see one user causing DOS by running multiple ls commands (via batch jobs) on large directories (those would starve all available allowed threads). Behavior now - if such a user hits the system his ls requests will queue up until first completes meanwhile giving other user ls requests on other directories a chance to run simultaneously (on paths whose hashcode != hashcode of that users's directory). Once his requests completes all his queued ls requests complete without hitting the database with the result of just completed request. And on that goes. Anyhow, I would like to hear what your users experience. The ever present ls queue should not be a problem. At least it is not a problem for us. We do not have millions of entries per directory. Max about 200K. So I am eager to hear if there are side effects.

Jul 25 '23 14:07 DmitryLitvintsev

[I removed this comment because of the next comment]

Jul 26 '23 12:07 onnozweers

I missed this part:

Another comment. The patch did not make it so that only one listing is running. The algorithm is as follows. There is a list of threads and the listing of path is dispatched to a thread number defined by:

N = hashcode(path) % pnfsmanager.limits.list-threads

So, it is possible that when there is a listing that lasts an hour, all listings assigned to the same thread have to wait, while other threads are doing nothing? Actually this does not make sense to me.

Jul 26 '23 13:07 onnozweers

This is much better than all threads busy processing N listings of 1 hour not letting anybody in at al. (this is what we had to deal with at Fermilab and hence the patch). With this patch others have a good chance and only 1 out of N long list queries will be actually run on DB. "Things are better than they look"

Besides you observing queuing in your monitoring, do you have customers complaining. This is what the important bit is.

Other things we use:

pnfsmanager.limits.list-chunk-size = 1000

(I think default is 100)

And raise number of list threads higher.

P.S.: Listing query lasting longer than PnfsManager timeout would trigger timeout error in directory list handler (so 1 hour listing is not really possible)

Jul 26 '23 14:07 DmitryLitvintsev

Users haven't complained yet, but I'm sure they will if this continues. I added this metric to our monitoring because users complained before (when we had 2 threads, the default).

I would be very happy if this were configurable, so that we can return to the old algorithm.

Jul 26 '23 14:07 onnozweers

Well.... give it a try with the increased list thread count and chunk size. My bet - they won't complain. Based on our experience.

If they don't complain - we won't make it optional :)

Jul 26 '23 14:07 DmitryLitvintsev

Sorry, whatever tuning may bring, I still think it's ugly. You have turned directory listing performance into a game of Russian roulette.

Jul 26 '23 14:07 onnozweers

Hi Onno,

I respect your opinion, but it works much better for us. We would have PnfsManager DOS weekly. Since we had the patch deployed - we never had any issues. So give it a try first.

BTW: just to add to it - GetFileAtrtributes messages have been processed exactly like that for many years. I just extended the same approach to listing.

Dmitry

Jul 26 '23 14:07 DmitryLitvintsev

We increased the threads from 10 to 100 and the chunk size from 100 to 10,000. We've also tried chunk size 1,000, and 10,000 does not seem to perform different than 1,000, so we'll probably revert to 1,000 during next maintenance.

We see that the maximum waiting time during peak hour is reduced from ~50 minutes to ~15 minutes. The srmls timeout is 2 minutes though, so we're not out of trouble yet.

Now waiting to hear from users.

Aug 01 '23 09:08 onnozweers

I think 10K chunk size is a little excessive. How is the memory footprint of the PnfsManager. The numbers you quoted ~ 50 minutes seems staggering to me.

Could you post:

"\sn show list activity -times"

and

"\sn info -a"

output. And, make sure this variable is not set to false:

pnfsmanager.enable.folding

(default is true). Folding is absolutely crucial here.

Also, the 50 minute wait. Do I get it correct you have 1M entries in directory and lists for that directory (and all the unfortunate ones that have the same hashcode % number_of_threads value) queue up?

And in the past having just 2 list threads you had no issues? Of course if your experience is so drastically worse than before I can make the functionality optional, or consider improvements to the algorithm.

The "info -a" output @ Fermilab public dCache system:

Statistics:
PnfsManagerV3                             average±stderr(ms)      min(ms)      max(ms)      STD(ms)      Samples       Period
  PnfsListDirectoryMessage                 275.90±      0.32            1      127,943     1,314.32   17,254,002      12 days
...

As you can see max about 2 minutes. And ..... I think anything longer would just fail.

and folding stats

PnfsManagerV3.Folded                  requests    failed
  PnfsGetFileAttributes                6392185         0
  PnfsListDirectoryMessage             2779888         0
  Total                                9172073         0

Whereas CMS T1 instance :

Statistics:
PnfsManagerV3                             average±stderr(ms)      min(ms)      max(ms)      STD(ms)      Samples       Period
  PnfsListDirectoryMessage                   4.71±      0.01            1       10,395        24.92   15,701,235      35 days

PnfsManagerV3.Folded                  requests    failed
  PnfsGetFileAttributes                 140852         0
  PnfsListDirectoryMessage                   3         0
  Total                                 140855         0

Typically I see no activity :

[ssh -p 22224 admin@cmsdcadiskhead "\sn show list activity -times"
Currently no list activity

Aug 01 '23 14:08 DmitryLitvintsev

The 3 million object dir listing usually happens only a few times per day. It's a backup from an object store. That explains why it's so many objects in a single dir: object stores usually don't care about a directory structure.

Indeed, all dir listings that were so unlucky to be assigned to the same thread, will have to wait for that listing to complete. Even with 100 threads, I can see around ~10 listings being queued.

It's a long time ago that we had the default of 2 threads. And yes, we had problems back then, that's why we increased it to 10. With 10, we had no problems, until the algorithm change in 8.2.23.

Srmls has a timeout of 2 minutes, but WebDAV operations don't. The 3 million objects listing is done through WebDAV. We've had to set up a special WebDAV door with a lot of memory, because otherwise it would OOM while providing the dir listing.

dcache.java.memory.heap=250G
dcache.java.memory.direct=100G

I'm aware we're pushing the boundaries here. 😉

The PnfsManager up to now had 8GB of heap, we've just increased it to 50 GB. (No OOM here though; just trying to avoid garbage collection)

The show list activity -times currently is empty, I'll try again during "rush hour".

[root@dcmain ~]# dcache-admin-command PnfsManager 'info -a'
--- ha-service-leadership-manager (Coordinates which pnfsmanager updates the filesystem stat cache) ---
HA role: LEADER

--- name-space-provider (Name space provider) ---
Acl Enabled: true
DB        : HikariDataSource (HikariPool-1)
DB Engine : PostgreSQL 11.20
rootID    : 000000000000000000000000000000000000
FsId      : 0
Statistics:
ChimeraNameSpaceProvider                  average±stderr(ms)      min(ms)      max(ms)      STD(ms)      Samples       Period
  addInodeLocation(FsInode|int|Strin         0.10±      0.00            0           60         0.31    2,648,476     28 hours
  clearInodeLocation(FsInode|int|Str         0.07±      0.00            0          108         0.29    1,986,657     28 hours
  createFile(FsInode|String|int|int|         0.59±      0.00            0          616         0.88    4,394,414     28 hours
  find(FsInode)                              0.03±      0.00            0          148         0.19   85,238,126     28 hours
  getACL(FsInode)                            0.03±      0.00            0          136         0.18  221,302,823     28 hours
  getAllTags(FsInode)                        0.05±      0.00            0          339         0.23   72,274,940     28 hours
  getInfo()                                  0.01±      0.00            0           10         0.15       11,946     28 hours
  getInodeChecksums(FsInode)                 0.04±      0.00            0          185         0.19   72,292,628     28 hours
  getInodeLocations(FsInode)                 0.04±      0.00            0          171         0.21   69,568,667     28 hours
  getInodeLocations(FsInode|int)             0.29±      0.13            0            1         0.47           14     26 hours
  getLabels(FsInode)                         0.16±      0.01            0            1         0.37        1,638     28 hours
  getStorageInfo(FsInode)                    0.04±      0.00            0           15         0.19    1,959,977     28 hours
  getXattr(FsInode|String)                   0.05±      0.00            0          219         0.23   51,344,696     28 hours
  id2inode(String|StatCacheOption)           0.04±      0.00            0          123         0.21   57,420,721     28 hours
  inode2id(FsInode)                          0.00±      0.00            0           13         0.04   18,723,880     28 hours
  inode2path(FsInode)                        0.13±      0.00            0           15         0.34    5,047,577     28 hours
  inodeOf(FsInode|String|StatCacheOp         0.07±      0.00            0          185         0.28    2,240,736     28 hours
  listXattrs(FsInode)                        0.05±      0.00            0          186         0.23   81,679,273     28 hours
  mkdir(FsInode|String|int|int|int)          0.49±      0.00            0            9         0.60       18,719     28 hours
  mkdir(FsInode|String|int|int|int|L         2.53±      0.09            0           30         2.69        1,000     28 hours
  newDirectoryStream(FsInode)               14.06±      0.54            0        7,718        97.65       32,246     28 hours
  path2inode(String)                         0.21±      0.00            0           63         0.42    4,918,729     28 hours
  path2inodes(String)                        0.30±      0.00            0          188         0.50   30,173,679     28 hours
  read(FsInode|int|long|byte[]|int|i         0.05±      0.00            0            1         0.21        5,711     28 hours
  remove(FsInode|String|FsInode)             0.69±      0.00            0          409         1.61    1,982,024     28 hours
  rename(FsInode|FsInode|String|FsIn         0.21±      0.00            0            7         0.42       38,622     28 hours
  resolvePath(String)                        0.29±      0.00            0          190         0.51   12,032,066     28 hours
  setInodeAttributes(FsInode|int|Sta         0.12±      0.00            0           54         0.35    2,635,645     28 hours
  setInodeChecksum(FsInode|int|Strin         0.06±      0.00            0           38         0.25    2,636,635     28 hours
  setStorageInfo(FsInode|InodeStorag         0.36±      0.02            0            1         0.48          480     28 hours
  setXattr(FsInode|String|byte[]|Set         0.13±      0.00            0           93         0.35    2,553,683     28 hours
  stat(FsInode|int)                          0.05±      0.00            0           75         0.23    8,276,882     28 hours
  updateFsStat()                        37,629.11±    718.01       35,251       51,974     3,799.34           28     27 hours
ChimeraNameSpaceProvider              requests    failed
  addInodeLocation(FsInode|int|Strin   2648476        34
  clearInodeLocation(FsInode|int|Str   1986657         0
  createFile(FsInode|String|int|int|   4394414   1758952
  find(FsInode)                       85238126         0
  getACL(FsInode)                    221302823         0
  getAllTags(FsInode)                 72274940         0
  getInfo()                              11946         0
  getInodeChecksums(FsInode)          72292628         0
  getInodeLocations(FsInode)          69568667         0
  getInodeLocations(FsInode|int)            14         0
  getLabels(FsInode)                      1638         0
  getStorageInfo(FsInode)              1959977         0
  getXattr(FsInode|String)            51344696         0
  id2inode(String|StatCacheOption)    57420721   1923593
  inode2id(FsInode)                   18723880         0
  inode2path(FsInode)                  5047577         0
  inodeOf(FsInode|String|StatCacheOp   2240736    257555
  listXattrs(FsInode)                 81679273         0
  mkdir(FsInode|String|int|int|int)      18719     10792
  mkdir(FsInode|String|int|int|int|L      1000         0
  newDirectoryStream(FsInode)            32246         0
  path2inode(String)                   4918729    217914
  path2inodes(String)                 30173679   2144632
  read(FsInode|int|long|byte[]|int|i      5711         0
  remove(FsInode|String|FsInode)       1982024         1
  rename(FsInode|FsInode|String|FsIn     38622         0
  resolvePath(String)                 12032066         0
  setInodeAttributes(FsInode|int|Sta   2635645       116
  setInodeChecksum(FsInode|int|Strin   2636635         0
  setStorageInfo(FsInode|InodeStorag       480         0
  setXattr(FsInode|String|byte[]|Set   2553683         0
  stat(FsInode|int)                    8276882   5618667
  updateFsStat()                            28         0
  Total                              813443338  11932256

--- pnfs-manager-ionotify ---
atime precision: Disabled

List queue: 0

Threads (24) Queue
    [0] 0
    [1] 0
    [2] 0
    [3] 0
    [4] 0
    [5] 0
    [6] 0
    [7] 0
    [8] 0
    [9] 0
    [10] 0
    [11] 0
    [12] 0
    [13] 0
    [14] 0
    [15] 0
    [16] 0
    [17] 0
    [18] 0
    [19] 0
    [20] 0
    [21] 0
    [22] 0
    [23] 0

Threads: 0

Statistics:
PnfsManagerV3                             average±stderr(ms)      min(ms)      max(ms)      STD(ms)      Samples       Period
  PnfsAddCacheLocationMessage                0.38±      0.02            0            4         0.52          915     28 hours
  PnfsCancelUpload                           3.05±      0.50            1           28         3.73           55     28 hours
  PnfsClearCacheLocationMessage              0.21±      0.00            0          109         0.50    1,986,987     28 hours
  PnfsCommitUpload                           2.87±      0.05            2           23         1.46          945     28 hours
  PnfsCreateEntryMessage                     1.65±      0.00            0          641         1.41    4,421,978     28 hours
  PnfsCreateSymLinkMessage                    NaN±       NaN            0            0          NaN            0     28 hours
  PnfsCreateUploadPath                       3.86±      0.09            1           32         3.00        1,000     28 hours
  PnfsDeleteEntryMessage                     1.71±      0.00            0          432         2.07    2,200,809     28 hours
  PnfsFlagMessage                             NaN±       NaN            0            0          NaN            0     28 hours
  PnfsGetCacheLocationsMessage                NaN±       NaN            0            0          NaN            0     28 hours
  PnfsGetFileAttributes                      0.71±      0.00            0          189         0.81   30,810,075     28 hours
  PnfsGetParentMessage                        NaN±       NaN            0            0          NaN            0     28 hours
  PnfsListDirectoryMessage                 591.89±     64.05            1      967,950    11,537.54       32,448     28 hours
  PnfsListExtendedAttributesMessage           NaN±       NaN            0            0          NaN            0     28 hours
  PnfsMapPathMessage                         0.25±      0.00            0           18         0.43    2,432,662     28 hours
  PnfsReadExtendedAttributesMessage           NaN±       NaN            0            0          NaN            0     28 hours
  PnfsRemoveChecksumMessage                   NaN±       NaN            0            0          NaN            0     28 hours
  PnfsRemoveExtendedAttributesMessag          NaN±       NaN            0            0          NaN            0     28 hours
  PnfsRemoveLabelsMessage                     NaN±       NaN            0            0          NaN            0     28 hours
  PnfsRenameMessage                          1.72±      0.00            0           60         0.97       37,677     28 hours
  PnfsSetFileAttributes                      0.58±      0.00            0          363         0.92    2,647,360     28 hours
  PnfsWriteExtendedAttributesMessage          NaN±       NaN            0            0          NaN            0     28 hours
  PoolFileFlushedMessage                     0.83±      0.04            0           16         0.86          480     28 hours
PnfsManagerV3.Folded                  requests    failed
  PnfsGetFileAttributes                 176923         0
  PnfsListDirectoryMessage                  38         0
  Total                                 176961         0

Aug 01 '23 14:08 onnozweers

Max 15 minutes for directory listing from above. Very little folding, meaning the same directory is not being hit often enough to trigger folding. OK, Lets wait for rush hour data.

P.S.: srmls timeout 2 minute. What are those? Are they "gfal-ls" executed by some SAM tests? And 2 minute is "their", that is client timeout? Or is it our internal timeout:

srmmanager.service.pnfsmanager.timeout = 120

I believe that when this one is hit srmls fails the request but of course PnfsManager still keeps processing it.

So I suggest to raise the above timeout to 15 minutes (based on your "info -a" output)

similarly for webdav:

webdav.service.pnfsmanager.timeout = 120000

(mind the units). Same 2 minute timeout. Probably needs to be raised to 15 minutes. as well.

Aug 01 '23 15:08 DmitryLitvintsev

I missed the rush hour, will try again tomorrow.

But saw a large spike in our listing queue graphs. It appears we had a listing that lasted 4028 seconds. I didn't expect that. In the end, 18 other listings were waiting for this one to finish.

[root@install-fes ~]# dcache-admin-command PnfsManager 'info -a' | grep PnfsListDirectoryMessage
  PnfsListDirectoryMessage                 708.14±     81.45            1    4,028,875    20,249.28       61,801       2 days
  PnfsListDirectoryMessage                  74         0

Aug 02 '23 14:08 onnozweers

Question. Can you identify the large directory that takes 4K seconds to process and see what protocol has issued the listing (srm, webdav?) and that this listing indeed succeeded and did not fail with timeout. Also, has the queue ever contain the same large directory (which could be an indication of retry by the door).

By now I realize that your usage pattern is different - occasionally or periodically your system runs a long listing and with the new algorithm it backs few unfortunate others up while it is running. This is opposed to our use case when many users run listing on the same, sometimes largish directories at the same time. We typically ask users to keep entries under 10K. There are few cases of ~ up to 200K but never millions.

I think this calls for making the algorithm optional for now.

At the same time I can see if I can make it more sophisticated.

NB: previous algorithm would fail if you gave 10 such queries running, all 10 threads will be locked for 4K seconds not allowing any other listings to go through. Increasing number of listing threads would kill the DB. This is basically what we had run into.

Aug 02 '23 15:08 DmitryLitvintsev

I think we have a very good informative listing here. I ran the command in a loop every minute for some hours. This is the last listing with a very long listing operation, right before it completed and the queue was drained. When finally the very large dir had finished, all 22 waiting list operations completed within one minute (because the next listing a minute later was completely empty).

Thu Aug  3 11:40:19 CEST 2023
QUEUED REQUESTS

             SOURCE                           USER    ARRIVED                              STARTED                              PATH
         SrmManager@srmmanagerDomain          lofops  2023-08-03 10:35:57.209 (64 min ago)                                      /pnfs/grid.sara.nl/data/lofar/ops/projects/lc0_003/151812   
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:57:58.960 (42 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/rivm/owncloudc02        
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:59:59.006 (40 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/rivm/owncloudc02        
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:00:00.537 (40 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/uva/owncloudf70         
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:02:00.568 (38 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/uva/owncloudf70         
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:03:34.951 (36 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud18c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:05:34.977 (34 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud18c
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:07:08.401 (33 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloud9ad  
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:08:04.911 (32 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud3f3
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:09:08.467 (31 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloud9ad  
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:10:05.194 (30 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/amsterdamumc/owncloud3f3
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:11:17.583 (29 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudacb  
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:13:17.676 (27 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/nhlstenden/owncloudacb  
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:18:59.372 (21 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/buas/owncloud21c        
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:20:55.467 (19 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/zuyd/owncloude13        
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:20:59.534 (19 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/buas/owncloud21c        
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:22:40.392 (17 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/che/owncloud894         
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:22:55.520 (17 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/zuyd/owncloude13        
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:24:40.459 (15 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/che/owncloud894         
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:26:54.862 (13 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/hvhl/owncloud593        
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 11:28:54.906 (11 min ago)                                      /pnfs/grid.sara.nl/data/swiftbackup/hvhl/owncloud593        
         SrmManager@srmmanagerDomain          lofops  2023-08-03 11:30:59.602 (9 min ago)                                       /pnfs/grid.sara.nl/data/lofar/ops/projects/lc0_003/183367   

ACTIVE REQUESTS

             SOURCE                           USER    ARRIVED                              STARTED                              PATH
webdav2880-badger14@webdav2880-badger14Domain buswift 2023-08-03 10:33:17.862 (67 min ago) 2023-08-03 10:33:17.862 (67 min ago) /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751

Aug 03 '23 10:08 onnozweers

Here's the associated graph:

Aug 03 '23 10:08 onnozweers

Does /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751 have millions of entries? And do you have a way of finding our that the listing actually succeeded? (see my previous message). Before I move on to touch the code I wanted to make sure that we are not seeing PnfsManager just spinning its wheels? I think WebDAV door access file would have an answer (if listing completed or timed out).

(The more I think about a listing that lasts > 1 hour, the less I am convinced that it is actually needed and could be an artifact of some other activity, like inadvertent "ls" whose only purpose is establishing that destination directory exists. We have seen something like that at Fermilab).

Aug 03 '23 14:08 DmitryLitvintsev

I can see that the dir listing gets an OOM error. It's already running with these settings (note the heap & direct memory, we've had to increase that several times in the past):

root     3152466       1 28 05:43 ?        01:23:22 /usr/bin/java -server -Xmx250G -XX:MaxDirectMemorySize=100G -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33000:34500 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/jgss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/webdav2880-badger14Domain-oom.hprof -XX:+ExitOnOutOfMemoryError -XX:+StartAttachListener -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.2.jar -Djava.net.preferIPv6Addresses=system -Djava.awt.headless=true -DwantLog4jSetup=n -Dorg.bouncycastle.dh.allow_unsafe_p_value=true -Djdk.tls.ephemeralDHKeySize=2048 -Djava.net.preferIPv6Addresses=false -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start webdav2880-badger14Domain

Here's the error:

07 Aug 2023 10:32:52 (webdav2880-badger14) [] /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751
java.lang.OutOfMemoryError: null
	at java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
	at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:119)
	at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
	at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
	at java.base/java.io.OutputStream.write(OutputStream.java:122)
	at io.milton.http.XmlWriter.append(XmlWriter.java:73)
	at io.milton.http.XmlWriter.writeElement(XmlWriter.java:421)
	at io.milton.http.XmlWriter.writeElement(XmlWriter.java:127)
	at io.milton.http.XmlWriter.writeProperty(XmlWriter.java:114)
	at io.milton.http.values.ToStringValueWriter.writeValue(ToStringValueWriter.java:52)
	at io.milton.http.values.ValueWriters.writeValue(ValueWriters.java:98)
	at io.milton.http.webdav.PropFindXmlGeneratorHelper.sendProperties(PropFindXmlGeneratorHelper.java:124)
	at io.milton.http.webdav.PropFindXmlGeneratorHelper.sendKnownProperties(PropFindXmlGeneratorHelper.java:114)
	at io.milton.http.webdav.PropFindXmlGeneratorHelper.appendResponse(PropFindXmlGeneratorHelper.java:102)
	at io.milton.http.webdav.PropFindXmlGeneratorHelper.appendResponses(PropFindXmlGeneratorHelper.java:91)
	at io.milton.http.webdav.PropFindXmlGenerator.generate(PropFindXmlGenerator.java:60)
	at io.milton.http.webdav.PropFindXmlGenerator.generate(PropFindXmlGenerator.java:69)
	at io.milton.http.webdav.DefaultWebDavResponseHandler.respondPropFind(DefaultWebDavResponseHandler.java:202)
	at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
	at org.dcache.webdav.AcceptAwareResponseHandler.respondPropFind(AcceptAwareResponseHandler.java:116)
	at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
	at org.dcache.webdav.WorkaroundsResponseHandler.respondPropFind(WorkaroundsResponseHandler.java:138)
	at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
	at io.milton.http.AbstractWrappingResponseHandler.respondPropFind(AbstractWrappingResponseHandler.java:200)
	at io.milton.http.webdav.PropFindHandler.processExistingResource(PropFindHandler.java:156)
	at io.milton.http.ResourceHandlerHelper.processResource(ResourceHandlerHelper.java:196)
	at io.milton.http.webdav.PropFindHandler.processResource(PropFindHandler.java:112)
	at io.milton.http.ResourceHandlerHelper.process(ResourceHandlerHelper.java:127)
	at org.dcache.webdav.DcacheResourceHandlerHelper.process(DcacheResourceHandlerHelper.java:42)
	at io.milton.http.webdav.PropFindHandler.process(PropFindHandler.java:106)
	at org.dcache.webdav.DcacheStandardFilter.process(DcacheStandardFilter.java:50)
	at io.milton.http.FilterChain.process(FilterChain.java:46)
	at org.dcache.webdav.transfer.CopyFilter.process(CopyFilter.java:276)
	at io.milton.http.FilterChain.process(FilterChain.java:46)
	at io.milton.http.HttpManager.process(HttpManager.java:158)
	at org.dcache.webdav.MiltonHandler.handle(MiltonHandler.java:77)
	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.dcache.http.AuthenticationHandler.access$001(AuthenticationHandler.java:55)
	at org.dcache.http.AuthenticationHandler.lambda$handle$0(AuthenticationHandler.java:156)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
	at org.dcache.http.AuthenticationHandler.handle(AuthenticationHandler.java:153)
	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.dcache.http.AbstractLoggingHandler.handle(AbstractLoggingHandler.java:110)
	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:516)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:555)
	at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:410)
	at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:164)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
	at java.base/java.lang.Thread.run(Thread.java:829)

When I did an ls in an NFS-mounted /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751, it was like a DoS of our PnfsManager. The transfer success rate (for everyone, not only this user) dropped to 20%. After I restarted the PnfsManager things got back to normal. I don't know how NFS interacts with the PnfsManager, but I bet it's not through your algorithm 😉

I have asked the user to tell me how many files are supposed to be in this dir. Perhaps on the client side he might be able to see this more easily.

The client is Rclone. I think Rclone wants to get a dir listing because it only uploads files that don't exist yet in the destination dir, or have changed since last upload.

Aug 07 '23 08:08 onnozweers

NFS by-passes PnfsManager altogether when running ls. You can check number of files by doing :

select inlink from t_inodes where ipnfsid = '<pnfsid of that directory>';

So, I think, is is fair to conclude that you have a problem with large directory that is:

being listed and never completes
impacts other listings due to algorithm change. Again the algorithm has been changed to avoid situations where multiple listings of the same large directory hog all available threads. And it does work nicely. But we do not have > 200K entries per directory. I think I will just add a new variable to make this behavior optional. Sorry for not anticipating your use case.

P.S.: I was surprised that the listing is not just timing out on the door end. May be its hits OOM sooner!?

Aug 07 '23 13:08 DmitryLitvintsev

[root@db1 ~]# psql -U postgres chimera -c "select inlink from t_inodes where ipnfsid = '00005D8D033B902A42249F3A7510D1146E28'"
  inlink  
----------
 13961036
(1 row)

Wow, almost 14 million... Even more than I thought.

Thanks for looking into this. In short term it would be wonderful if the algorithm is configurable. For the future, I foresee more problems with such a huge and growing directory. We might need to set up a separate instance for them so they don't DoS other users. But that's a big operation that we can't do on short notice.

Aug 07 '23 13:08 onnozweers

Returning from holiday, I became aware of this issue and ticket. To re-enforce Onno's point, I'd like to share that our case (LOFAR Long-Term Archive) is certainly affected by the new behaviour. LOFAR is significantly different from the FERMI case as I understand it.

For LOFAR

Files are generally structured hierarchically in the archive with a few hundred to a few thousand files per directory
We do not allow our users to browse the archive (instead, file locations are provided through our archive database), therefore multiple listings for the same directory are uncommon
We do operate services that do many (>> 100) listings on directories for indexing, verification, and consistency checking As a result, up to one in every so many listings is now expected to time out with the new algorithm.

As this looks like spurious behaviour it was not identified until now, although we have seen a slight rise in unexplained operations that take much longer, or are failing, than we are used to for SURF. Looking into the logs for one of our services, it is nevertheless clear that starting July 1st the following error is being reported multiple times per day for listings at SURF: "List failed [rc=10006,msg=Timeout during directory listing.]"

From the thread I conclude that the issue is planned to be addressed by allowing the algorithm for directory listings to be made configurable. For LOFAR this fix will be very important and much appreciated.

Aug 08 '23 13:08 holties-astron

The problem will hit you when you used up all available threads when using "old algorithm", increasing the number of list threads will lead to database thrashing leading to timeouts. But yes, we will make the behavior of listing optional ASAP.

Aug 08 '23 15:08 DmitryLitvintsev

Pull request has been made.

Aug 09 '23 20:08 DmitryLitvintsev

@onnozweers are there subdirectories under /pnfs/grid.sara.nl/data/swiftbackup/pbl/owncloud751

Not directly related to issue, but we seem to be seeing long running listings by Webdav clients. We suspect that PROPFIND calls without Depth header result in recursive listing.

This is because even 15M entries directory just with files should have completed faster than 70 minutes.

P.S.: NVM, not recursive listing, but door receives your 15M entries and starts to loop over them sending PROPFIND request for each of them. This is where all the time is spent. I just got to the bottom of it.

Aug 10 '23 21:08 DmitryLitvintsev

We saw the patch in 8.2.28. Thanks a lot! I really appreciate the effort you put into this. We'll update soon.

About the PROPFIND, not sure what you mean. Do you mean that the door, while getting entries from the PnfsManager, also starts sending PROPFIND results? And that these slow down the listing operation in the PnfsManager? Anyway, let me know if you need my help. I'll be out of office for a few weeks though.

Aug 15 '23 13:08 onnozweers

dcache dcache copied to clipboard

Only one directory listing active, the rest queued; pnfsmanager.limits.list-threads broke

dcache
dcache copied to clipboard