glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Deadlock when multiple clients are deleting the same directory structure

Open xhernandez opened this issue 2 years ago • 6 comments

Description of problem:

Under some circumstances, there's a deadlock in the locks xlator when there are multiple clients deleting the same directories concurrently.

Additional info:

To explain how the issue happens, first a simplified high level overview of the operations performed during an rmdir:

  1. DHT takes a shared lock of the parent inode (inodelk) of the directory to remove in the hashed subvolume (even though the directory exists on all subvolumes, the lock is only performed in the one where the name of the entry hashes to).
  2. DHT takes an exclusive lock of the entry to remove (entrylk), also in the hashed subvolume.
  3. DHT sends an rmdir to all subvolumes but to the hashed one. 3.1. EC takes an exclusive lock of the parent inode (inodelk) of the directory to remove (this lock is on another namespace than DHT, so they don't conflict). 3.2. EC sends the rmdir. 3.3. EC unlocks the parent.
  4. DHT sends an rmdir to the hashed subvolume. 4.1. EC takes an exclusive lock of the parent inode (inodelk) of the directory to remove (this lock is on another namespace than DHT, so they don't conflict). 4.2. EC sends the rmdir. 4.3. EC unlocks the parent.

It's also important to know what conditions need to be satisfied in the locks xlator to block an rmdir and an inodelk. After patch https://review.gluster.org/20025, an rmdir blocks if all these conditions are met:

  1. There are granted inodelks (to the inode being removed, not its parent) coming from other clients than the one sending the rmdir.
  2. At least one of the granted inodelks doesn't come from an internal operation (negative pids).

An inodelk blocks if all these conditions are met:

  1. The inodelk doesn't come from an internal operation.
  2. There's a pending remove operation on the same inode.
  3. None of the granted inodelks come from the same client sending the inodelk.

if they are not met, it could also block if:

  1. There's a conflicting lock, even if it's not granted yet.

Based on this suppose we have the following setup:

  • A distributed disperse 2 x (4 + 2) volume
  • The volume contains /B/D1 and /B/D2
  • B hashes to the second subvolume
  • D1 hashes to the first subvolume
  • D2 hashes to the second subvolume

With 3 clients C1, C2 and C3 doing these operations:

  • C1: rmdir(/B/D1)
  • C2: rmdir(/B/D2)
  • C3: rmdir(/B/D1); rmdir(/B/D2); rmdir(/B)

Now the following sequence of operations causes a deadlock:

    SubVol0                        SubVol1
----------------------------------------------------------------
/* C1 and C2 start their rmdirs. */
 1. C1.DHT.INODELK(/B)             C2.DHT.INODELK(/B)

/* C3 starts rmdir(/B/D1) */
 2. C3.DHT.INODELK(/B)                                           * This doesn't block because it's a shared lock
 3. C3.DHT.ENTRYLK(/B, D1)
 4. C3.EC.INODELK(/B)                                            * This doesn't block because it's in another namespace
 5. C3.EC.RMDIR(/B/D1)
 6. /* C3 releases all locks. */

/* C3 starts rmdir(/B/D2). */
 7.                                C3.DHT.INODELK(/B)            * This doesn't block because it's a shared lock
 8.                                C3.DHT.ENTRYLK(/B, D2)
 9.                                C3.EC.INODELK(/B)             * This doesn't block because it's in another namespace
10.                                C3.EC.RMDIR(/B/D2)            * Now /B is empty
11.                                /* C3 releases all locks. */

/* C1 and C2 make some progress. */
12. C1.DHT.ENTRYLK(/B, D1)         C2.DHT.ENTRYLK(/B, D2)

/* C3 starts rmdir(/B). */
13.                                C3.DHT.INODELK(/)
14.                                C3.DHT.ENTRYLK(/, B)
15. C3.EC.INODELK(/)
16. C3.EC.RMDIR(/B)                                              * This blocks because C1 holds an active inodelk on /B

/* C2 attempts rmdir on non-hashed subvol. */
17. C2.EC.INODELK(/B)                                            * This blocks because there's a pending rmdir and no granted inodelk for C2

/* C1 progresses with rmdir on non-hashed subvol. */
18.                                C1.EC.INODELK(/B)             * This doesn't block because it's in another namespace
19.                                C1.EC.RMDIR(/B/D1)            * No issue, the rmdir proceeds
20.                                C1.EC.UNLOCK(/B)

/* C1 attempts rmdir on hashed subvol. */
21. C1.EC.INODELK(/B)                                            * This blocks because it conflicts with an inodelk from C2

At this point all 3 clients are deadlocked:

  1. C1.DHT.INODELK(/B) --> Granted
  2. C1.DHT.ENTRYLK(/B, D1) --> Granted
  3. C3.EC.INODELK(/) --> Granted
  4. C3.EC.RMDIR(/B) --> Blocked by (1)
  5. C2.EC.INODELK(/B) --> Blocked by (4)
  6. C1.EC.INODELK(/B) --> Blocked by (5)

To release (1) and unblock (4), (6) would need to progress, but it can't because of (5), which depends in (4).

xhernandez avatar Aug 01 '22 11:08 xhernandez

The analysis has been done using EC, but in theory the same problem exists with AFR.

The issue could be avoided by sending some inodelk in DHT as internal operations or doing some other small changes, but the problematic part is the way the locks xlator makes sure that deletes are not executed concurrently with other operations. The way it's implemented causes locks from different namespaces to interact during delete operations. This is specially bad when there are nested locks, like in this case. Even though the patch introduced several checks to try to avoid deadlocks, it's clear that it didn't fix all cases, and maybe there are others still unknown. This means that the method is intrinsically risky and I'll try to change it.

My proposal to remove the risky interdependencies between lock namespaces consists in involving the client side in the logic (the previous change was transparent for clients) so that the control at locks xlator can be simpler and safe.

Proposal

The idea is to allow client side xlators to pass additional sibling gfid in inodelk requests. Each sibling lock will be acquired with the same owner and modes than the main lock. So if the main block is acquired in non-blocking mode, the sibling will also be acquired in non-blocking mode. If any of the locks fail or is removed before all siblings are granted, all of them will be released or removed from the blocked list. They operate as a single entity.

The main lock will always be the first one to be acquired. The sibling locks will be acquired in the order determined by its gfid to avoid deadlocks with other operations (if there's more than one sibling).

Unlocks will always be sent only to the main lock. The sibling locks will be implicitly unlocked. When contention is detected in locks xlator with a sibling, it will report the issue back to the client using the main lock, not the specific lock that contended.

This is particularly addressed to AFR and EC.

These siblings can be passed inside xdata, and should contain the gfid of the extra inode to lock.

For this particular case, rmdir, unlink and rename should add the gfid of the target entry. Since this means that the entry to be removed will also be locked, any other concurrent operation trying to access that inode will correctly synchronize with the removal: either all bricks will process the other operation and then remove the entry, or the entry will be removed before attempting the operation (which will probably fail). But there's no mix of successes and failures in each of the operations.

With this, the special handling in locks xlator for deletes is not needed anymore, even though that probably we should keep it for backward compatibility.

Analysis of some cases

  • GFID1 and GFID2 are acquired in non-blocking mode

    If GFID1 fails to block, EAGAIN is returned. Otherwise it's granted and GFID2 is checked. If it cannot be granted, GFID1 is released and EAGAIN returned. Otherwise Both are ganted and the operation finishes successfully.

  • GFID1 and GFID2 are acquired in blocking mode

    If GFID1 fails to block, it's added to the blocked queue and nothing is attempted with GFID2. Once GFID1 is granted, GFID2 will be attempted. If it cannot be granted it will be added to the blocked queue. Once it can be granted the operation finishes and returns to the client.

  • GFID1 is unlocked after a successful inodelk

    Locks xlator will know that GFID1 has GFID2 as sibling and will unlock all of them.

  • Locks xlator detects contention on GFID1

    As always a notification is sent to the client for GFID1

  • Locks xlator detects contention on GFID2

    The xlator knows that GFID2 is a sibling of GFID1 and sends the notification to client for GFID1.

Backward compatibility

  • Old client, new server

    In this case the client won't add the sibling information so locks xlator will behave as it currently does.

  • New client, old server

    In this case the client will add the sibling information but locks xlator will ignore it. No extra locks will be taken, as it's currently happening.

  • New client, new server

    The siblings will be sent by the client and processed by the server. When a delete operation comes, it will see that there's a granted lock from the same client in the inode to be deleted, which will allow the delete operation to be executed unblocked.

Given that this works individually on a brick level, I don't foresee any issue with mixed old and new clients, or mixed old and new servers.

@pranithk @mohit84 I would like to know your point of view for this approach.

xhernandez avatar Aug 01 '22 12:08 xhernandez

Thanks Xavi for solving the doubts, I think you can go with this approach.

mohit84 avatar Aug 02 '22 09:08 mohit84

Note about the order in which locks are acquired:

The main GFID is always acquired first to make it backward compatible with other bricks/clients that may not be using the new version yet. This theoretically means that one client could try to acquire GFID1 as the main lock and GFID2 as the sibling one, while another client uses GFID2 as the main one and GFID1 as the sibling.

This can easily cause a deadlock, but this case shouldn't happen because all clients should agree on what's a main lock and what's a sibling lock. The sibling lock should be an extra lock for increased consistency, but the operation should still work without it. A main lock is always required to prevent corruption.

In any case we may decide to order all locks by gfid, including the main one to avoid this deadlock possibility, but in this case we'll break backward compatibility and it will be necessary to use a new op-version to activate this feature. I would prefer to keep it backward compatible for now, and maybe address this case in the future if the feature is used in more cases to reduce the number of lock requests for each fop.

xhernandez avatar Aug 02 '22 10:08 xhernandez

I've identified a problem when eager-locking is enabled (only in EC, AFR doesn't use eager-locking for non-regular files if I'm not wrong).

The issue is that if the inode has already been acquired before the rmdir it won't include the additional lock, so it will need to be sent as a regular inodelk (this means an additional request, which could cause some performance regression for rmdirs in some cases). On the other side, if the inodelk on the parent needs to be reused after rmdir, the unlock won't be sent and the sibling lock on the removed entry may remain there for some time. It's not necessarily bad, but it's not good either...

I'll try to find a better solution.

xhernandez avatar Aug 02 '22 16:08 xhernandez

After some thinking I've realized that the best way to fix this issue is actually to remove the dependencies between deletes and other operations on the inode being deleted.

A delete is basically an operation on the parent directory, so taking a lock on the parent should be enough. The problem is that Gluster physically deletes the files when delete operations are processed, which means that any other operation running on the inode at the same time may fail in some bricks and succeed in others, potentially causing issues.

But if we just remove the deleted entry from the parent directory when a delete is processed but we keep the inode around (i.e. the gfid and maybe some additional information) until the last reference to the inode is released, the other operation will be able to succeed on all bricks, without causing issues.

When the last reference to the inode is released, which means that there can't be any operation using the inode, the posix xlator will physically delete it if its number of hardlinks is 0.

This is the way that kernel filesystems work. It will also help improve the implementation of O_PATH (#2717) and make Gluster more consistent and predictable in some test cases involving continuous creation and delete (or rename) of the same entry, which currently causes some issues.

This seems to me the cleanest way to solve the issue, and it will have many other benefits beyond this particular problem.

xhernandez avatar Aug 03 '22 06:08 xhernandez

I agree that, with this model, many other issues of reference also would be handled well. We can confidently work on gfid path for inode reference and path for entry reference, without any concern of other operations taking out the entry itself. Just that posix xlator should get better at stale gfid file cleanup.

amarts avatar Aug 03 '22 06:08 amarts

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Mar 19 '23 22:03 stale[bot]

This issue will be handled in #3812.

xhernandez avatar Mar 20 '23 11:03 xhernandez

I have a quick question regarding the reproducing steps.

Based on this suppose we have the following setup:

  • A distributed disperse 2 x (4 + 2) volume
  • The volume contains /B/D1 and /B/D2
  • B hashes to the second subvolume
  • D1 hashes to the first subvolume
  • D2 hashes to the second subvolume

In the above setup, If I understand correctly, B, D1, and D2 are all directories, and they are only hashed to one of the sub volumes in a 2x(4+2) setup. But wouldn't all the directories be hashed to all of the sub volume?

ingerido avatar Dec 30 '23 04:12 ingerido