glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Infinite loop in dht when lookup fails with ENODATA

Open itisravi opened this issue 2 years ago • 0 comments

Description of problem: When two clients simultaneously create and unlink the same file in a loop (stress testing), the client doing the unlink was hung and unresponsive to CTRL-C. On examining, it was observed that when dht_lookup_cbk() failed with ENODATA (since the other client had created the file but not yet set the gfid), it was triggering a recursive loop of lookups (on the client doing the unlink):

dht_lookup_cbk  ───────────► dht_lookup_directory ──►dht_lookup_dir_cbk───►dht_lookup_everywhere ───► dht_lookup_everywhere_done
                                    ▲                                                                         │
                                    │                                                                         │
                                    │                                                                         │
                                    │                                                                         │
                                    │                                                                         │
                                    └─────────────────────────────────────────────────────────────────────────▼

The exact command to reproduce the issue:

  1. mount -t glusterfs IP:volname /mnt/1
  2. mount -t glusterfs IP:volname /mnt/2
  3. On /mnt1/: while true; do touch f1; done
  4. On /mnt/2: while true; do rm -f f1; done
  5. Hit CTRL-C on both mounts. The one on /mnt/1 returns while the one on /mnt/2 hangs since its stuck in an infinite lookup loop.
Terminal 2:
[root@host ~]# cd /mnt/2/
[root@host 2]# while true; do rm -f f1; done
^C  <---------Hung

Terminal 1:
[root@host ~]# cd /mnt/1/
[[root@host 1]# while true; do touch f1; done
touch: setting times of ‘f1’: Stale file handle
touch: setting times of ‘f1’: Stale file handle
touch: setting times of ‘f1’: Stale file handle
^C <--------Not hung, it exits.
[[root@host 1]#

- The output of the gluster volume info command: Volume Name: distvol Type: Distribute Volume ID: c2f2b9a6-ab33-4344-ae17-fe7c8c8288a0 Status: Started Snapshot Count: 0 Number of Bricks: 5 Transport-type: tcp Bricks: Brick1: IP1:/brickl Brick2: IP2:/brick2 Brick3: IP3:/brick3 Brick4: IP4:/brick4 Brick5: IP5:/brick5 Options Reconfigured: cluster.lookup-optimize: on diagnostics.client-log-level: INFO features.read-only: off diagnostics.count-fop-hits: on diagnostics.latency-measurement: on storage.reserve: 42949672960 config.client-threads: 4 network.inode-lru-limit: 90000 features.ctime: off auth.allow: * diagnostics.client-sys-log-level: WARNING diagnostics.brick-sys-log-level: WARNING storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on

itisravi avatar Aug 06 '22 04:08 itisravi