glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Stale locks appear after a successful interruption of a blocked posix lock

Open xhernandez opened this issue 3 years ago • 4 comments

Description of problem:

When a process blocked waiting for a posix lock is interrupted by a signal, the request is cancelled with EINTR, but internally AFR keeps sending LK requests to other bricks, which may cause issues since these locks are not owned by anyone and won't be released in most cases.

The exact command to reproduce the issue:

Using the program provided here, this issue can be seen running these tests:

    test_wrlock(0, 1); /* this lock is granted. */
    test_wrlock(1, 1); /* this lock is blocked. */
    test_interrupt(1, 1); /* this should cancel previous lock. */
    test_unlock(0, 1); /* first lock released. */
    test_wrlock(0, 1); /* this should succeed. */
    test_unlock(0, 1);

The full output of the command that failed:

# gluster volume create test replica 3 server:/bricks/test_{1..3}
# gluster volume start test
# mount -t glusterfs server:/test /mnt/test
# touch /mnt/test/file
# ./test /mnt/test/file
  0: Locking
  0: Locked
  1: Locking
  1: Received signal 18
  1: fcntl() failed: (4) Interrupted system call
  0: Unlocking
  0: Unlocked
  0: Locking
<hang>

Expected results:

It shouldn't hang.

Additional info:

The issue happens because AFR takes posix locks in a sequential way, and only checks errors after the LK fop has been sent to all bricks. In the case of interrupts, the LK request is unwound by FUSE as soon as the interrupt request succeeds, so AFR shouldn't continue processing them in this case.

However, the way the locks xlator is implemented, makes it difficult to "undo" the already acquired posix locks in case of interrupt in the middle of acquisition.

xhernandez avatar Jan 27 '22 15:01 xhernandez

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Sep 21 '22 00:09 stale[bot]

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar May 21 '23 16:05 stale[bot]

Hi @xhernandez , I am interested in this issue, but got the following result after runnig, is that c program related to certain os or env?

  0: Locking
  0: Locked
  1: Locking
  1: Received signal 18
  1: fcntl() failed: (11) Resource temporarily unavailable
  0: Unlocking
  0: Unlocked
  2: Locking
  2: Locked
  1: Locking

chen1195585098 avatar Jan 08 '24 09:01 chen1195585098

Hi @xhernandez , I am interested in this issue, but got the following result after runnig, is that c program related to certain os or env?

It's a very generic program that should work on (almost ?) all OS based on Linux.

xhernandez avatar Jan 16 '24 17:01 xhernandez