XIA-for-Linux icon indicating copy to clipboard operation
XIA-for-Linux copied to clipboard

xia/lpm: remove routing/forwarding deadlock

Open cjdoucette opened this issue 8 years ago • 4 comments

Due to the way that locking was written into this toy LPM implementation, the read-only code in lpm_deliver() obtains a write lock on the tree FIB. The effect of this is the unnatural combination of first obtaining an RCU read lock and then a FIB write lock in lpm_deliver(). This conflicts with functions like local_newroute(), which first obtain a FIB write lock and then wait for RCU readers to finish before flushing anchors. The problem is that deadlock can occur with the following interleaving:

thread1                                 thread2
=======                                 =======
local_newroute(): get write lock
                                        lpm_deliver(): get RCU read lock
local_newroute(): wait for RCU readers
                                        lpm_deliver(): wait for write lock

Notice that this deadlock cannot occur if lpm_deliver() gets the write lock first.

To fix this, we can duplicate the FIB entry whose anchor needs to be flushed, replace the old entry with the duplicate, and then release the lock. This allows writers of the tree, including the code in lpm_deliver(), to proceed. We then wait an RCU synchronization to flush the old entry's anchor, since routing mechanism code (not involved with the LPM tree) may still be reading that entry. Once all these readers are done, the old entry can be reclaimed.

This removes the deadlock, but future iterations of the LPM principal should use RCU instead of rwlocks to avoid this unnatural locking.

cjdoucette avatar May 16 '16 04:05 cjdoucette

Hi Cody,

I found a rare bug that can occur at local_newroute() or main_newroute().

Consider the following execution:

  1. tree_fib_newroute_lock() returns successfully.
  2. newroute_flush_anchor_locked() fails.
  3. fxid_free_norcu() is going to release an item that is live in the tree due to step 1.

AltraMayor avatar Sep 06 '16 17:09 AltraMayor

Thanks, Michel. I just rebased so that the patch has newroute_flush_anchor_locked() remove the entry if it's going to fail.

cjdoucette avatar Sep 06 '16 20:09 cjdoucette

The pull request is ready for review again. I separated the other bug found into a separate patch, since it is a distinct change that affected multiple files.

cjdoucette avatar Sep 13 '16 05:09 cjdoucette

Patch xia/lpm: remove routing/forwarding deadlock is ready for merge.

AltraMayor avatar Nov 25 '16 16:11 AltraMayor