glusterfs
glusterfs copied to clipboard
Stale locks appear after a successful interruption of a blocked posix lock
Description of problem:
When a process blocked waiting for a posix lock is interrupted by a signal, the request is cancelled with EINTR, but internally AFR keeps sending LK requests to other bricks, which may cause issues since these locks are not owned by anyone and won't be released in most cases.
The exact command to reproduce the issue:
Using the program provided here, this issue can be seen running these tests:
test_wrlock(0, 1); /* this lock is granted. */
test_wrlock(1, 1); /* this lock is blocked. */
test_interrupt(1, 1); /* this should cancel previous lock. */
test_unlock(0, 1); /* first lock released. */
test_wrlock(0, 1); /* this should succeed. */
test_unlock(0, 1);
The full output of the command that failed:
# gluster volume create test replica 3 server:/bricks/test_{1..3}
# gluster volume start test
# mount -t glusterfs server:/test /mnt/test
# touch /mnt/test/file
# ./test /mnt/test/file
0: Locking
0: Locked
1: Locking
1: Received signal 18
1: fcntl() failed: (4) Interrupted system call
0: Unlocking
0: Unlocked
0: Locking
<hang>
Expected results:
It shouldn't hang.
Additional info:
The issue happens because AFR takes posix locks in a sequential way, and only checks errors after the LK fop has been sent to all bricks. In the case of interrupts, the LK request is unwound by FUSE as soon as the interrupt request succeeds, so AFR shouldn't continue processing them in this case.
However, the way the locks xlator is implemented, makes it difficult to "undo" the already acquired posix locks in case of interrupt in the middle of acquisition.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Hi @xhernandez , I am interested in this issue, but got the following result after runnig, is that c program related to certain os or env?
0: Locking
0: Locked
1: Locking
1: Received signal 18
1: fcntl() failed: (11) Resource temporarily unavailable
0: Unlocking
0: Unlocked
2: Locking
2: Locked
1: Locking
Hi @xhernandez , I am interested in this issue, but got the following result after runnig, is that c program related to certain os or env?
It's a very generic program that should work on (almost ?) all OS based on Linux.