dcache NFS: ls on directory hangs, dCache 7.2

dCache 7.2 client : Scientific Linux release 7.9 (Nitrogen)

[litvinse@mu2egpvm01 ~]$  ls /pnfs/mu2e/persistent/users/mu2epro/valjob/2022/02/09/summary/
^C

Hangs.

ls on parent works:

[litvinse@mu2egpvm01 ~]$ time ls  /pnfs/mu2e/persistent/users/mu2epro/valjob/2022/02/09/
ceMix  ceSimReco  cosmicSimReco  pileup  potSim  reco  stops  summary

real	0m0.167s
user	0m0.002s
sys	0m0.005s
[litvinse@mu2egpvm01 ~]$

Anything short of rebooting client machine ? (killing client from admin shell does not help)

Enabling NFS debug does not give any useful info (to me):

Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: NFS: nfs_pgio_result: 30950, (status -10025)
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: NFS: nfs_pgio_result: 39361, (status -10025)
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: nfs41_sequence_process: Error 0 free the slot 
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: nfs41_sequence_process: Error 0 free the slot 
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: NFS: nfs_pgio_result: 42675, (status -10025)
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: NFS: nfs_pgio_result: 41715, (status -10025)
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: nfs41_sequence_process: Error 0 free the slot 
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16
Feb  9 10:58:24 mu2egpvm01 kernel: <-- nfs4_alloc_slot used_slots=ffff highest_used=15 slotid=4294967295
Feb  9 10:58:24 mu2egpvm01 kernel: nfs41_sequence_process: Error 0 free the slot 
Feb  9 10:58:24 mu2egpvm01 kernel: --> nfs4_alloc_slot used_slots=ffff highest_used=15 max_slots=16

Feb 09 '22 17:02 DmitryLitvintsev

Capturing the NFS traffic between the client and server (using a tool like tcpdump) is a good bet. This might (or might not) provide some additional insight into what's going wrong.

You could also enable debug logging on the org.dcache.nfs.v4.OperationREADDIR logger in the NFS door. The admin command should be:

log set stdout org.dcache.nfs.v4.OperationREADDIR DEBUG

Feb 09 '22 19:02 paulmillar

nfs.dump.gz

Here we go, dcpdump attached

 tcpdump -i eth0 -w /tmp/nfs.dump host mu2egpvm01

Feb 09 '22 19:02 DmitryLitvintsev

OK, quick summary: the client seems to be scanning through the directory response and requesting file attributes (GETATTR) for the contents of the directory. AFAIK, this is normal behaviour for an NFS client.

Each request is a compound request but that, too, is normal (AFAIK).

The responses seems OK (to my eyes), but the client keeps asking about just two directory items, cycling through them. The client asks about FileHandle-A, then FileHandle-B, then FileHandle-A, then FileHandle-B, and so on.

Ignoring the possibility of this directory having a large number of hard-links of the same two files (very unlikely), it looks like either the client is caught in a loop or dCache's directly listing went mad.

If dCache's directory listing was at fault (caught in a loop) then why would that loop terminate, unless the client has some internal limit after which it just ignores further directory items.

I think we need @kofemann's expertise to go any further here.

Feb 09 '22 20:02 paulmillar