rdma-core rdmacm: Fix race condition between destroy_qp and get_cm

When rdma_destroy_qp() and rdma_get_cm_event() are invoked in different threads concurrently, there is a possibility that ibv_modify_qp() is invoked on an already-destroyed qp inside the rdma_get_cm_event(), which violates the thread-safety guarantee of rdmacm.

One possible workaround for user was to protect these two functions with one mutex, but this would introduce an event_channel-level lock that is shared among all the cm_id associated with the channel, which will lead to high contention on this lock. So it is better to resolve this problem inside rdmacm.

We take advantage of the cma_id_private.mut and use it to sync access to qp inside these two functions.

Signed-off-by: Huaping Zhou [email protected]

Jun 03 '22 11:06 FujiZ

AFAIK you can't multithread rdma_get_cm_event() with other CM operations, there is zero locking in this function so almost everything it does races against another cm operation of some kind. Apps certainly can't perform rdma_destroy_id() concurrently with get_event, I don't see why destroy_qp should be any different??

Jun 07 '22 17:06 jgunthorpe

AFAIK you can't multithread rdma_get_cm_event() with other CM operations, there is zero locking in this function so almost everything it does races against another cm operation of some kind.

I do some search in the mailing list of openfabrics and find this message which claims that librdmacm is thread safe, and the only exception is rdma_migrate_id(): https://lists.openfabrics.org/pipermail/general/2008-August/053908.html

Apps certainly can't perform rdma_destroy_id() concurrently with get_event, I don't see why destroy_qp should be any different??

As it is noted in man 3 rdma_get_cm_event, "Destruction of an rdma_cm_id will block until related events have been acknowledged". If the concurrency between rdma_destroy_id() and rdma_get_cm_event() were not allowed, there would be no point in providing this guarantee since it would cause nothing but a deadlock if all cm operations must be serialized.

Jun 08 '22 03:06 FujiZ

Apps certainly can't perform rdma_destroy_id() concurrently with get_event, I don't see why destroy_qp should be any different??

As it is noted in man 3 rdma_get_cm_event, "Destruction of an rdma_cm_id will block until related events have been acknowledged". If the concurrency between rdma_destroy_id() and rdma_get_cm_event() were not allowed, there would be no point in providing this guarantee since it would cause nothing but a deadlock if all cm operations must be serialized.

If I read the kernel code correctly, the rdma_destroy_id() separates between reported and not reported events. https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/ucma.c#L617 https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/ucma.c#L566 https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/ucma.c#L522

While reported event is not more than exit from ucma_get_event(), which is kernel part of rdma_get_cm_event(). https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/ucma.c#L409

There is no protection from calling to rdma_destroy_id()->ucma_destroy_kern_id() in the middle of rdma_get_cm_event().

Jun 08 '22 05:06 rleon

The resources associated with one cm_id consist of two parts: 1) user space part which is represented by struct cma_id_private; 2) kernel space part which is struct ucma_context. So destroying a cm_id involves destructing both parts properly.

What ucma_destroy_id() does is to destroy the kernel space part of cm_id, which involves cleaning up all unreported events associated with it. This is done by ucma_cleanup_ctx_events() (https://elixir.bootlin.com/linux/latest/source/drivers/infiniband/core/ucma.c#L522). Note that access to the event_list and events_reported is protected by the same mutex as ucma_get_event().

After ucma_destroy_id() returns, it guarantees that the kernel part of the cm_id is destroyed, and no more events associated with it will be reported to user space through ucma_get_event(). However, the user space part can not be destroyed immediately, since there may still be some unacknowledged events in user space. So after that rdma_destroy_id() uses condition variable + mutex to wait until events_completed == event_reported, which means that all the events associated with this cm_id that are reported to user space are acknowledged by rdma_ack_cm_event().

If rdma_get_cm_event() runs concurrently with rdma_destroy_id(), it is possible for ucma_get_event() to return an event associated with a cm_id whose kernel space part has already been freed. However, since the processing of event inside rdma_get_cm_event() only touches the user space part of the cm_id (struct cma_id_private), it doesn't matter if the kernel space part of it is destroyed or not. On the other hand, the user space part of one cm_id is guaranteed to be valid after the associating event is returned to user space through ucma_get_event(): since the event_reported counter in struct ucma_context is protected by a mutex, it is the exact number of events that are passed to user space. Thus, rdma_destroy_id() will block even if the associating unacknowledged event is just returned from ucma_get_event() and has not been returned to the caller of rdma_get_cm_event().

To summarize, I think rdma_get_cm_event() and rdma_destroy_id() are thread safe.

Jun 08 '22 11:06 FujiZ

Okay, that seems correct, but it was cargo cult copied from the verbs version of this and it doesn't seem to have been fully implemented across all the state transitions - ie it should have been used to synchronize the qp destruction too if full thread safety was the goal.

But I'm still not completely convinced the intention was the allow events to run concurrently with much of anything else, qp manipultions in particular are never locked anywhere, it looks like it is expected these will somehow run single threaded. Like you can't race rdma_accept with rdma_destroy_qp or many other calls. Which all suggests to me that destroying CM ids cannot be done from outside the thread processing the event loop.

What are you trying to do that requires concurrency here?

Jun 08 '22 12:06 jgunthorpe

Our application uses a background thread to handle connection setup procedure. It repeatedly calls rdma_get_cm_event(), initializes the necessary resource (QP, receive buffers, etc.) and handles state transitions of connections. Once the connection is established (by getting ESTABLISHED event), it will emit a message through a message queue to notify the target IO thread which the connection is bound to that this connection is ready to use. Since almost all the resource initialization and state transitions are handled in the background thread, they will not interfere with the ongoing IO operations in the IO thread, thus making the IO throughput & latency stable even when there are many connections to establish.

When the user wants to close a connection, it calls rdma_destroy_qp() and rdma_destroy_id() directly in the IO thread. Moreover, the user may want to close a connection before it is actually established (e.g. The connection setup procedure takes too long and a user-specified timer is fired). The resource of one connection can be easily protected by a mutex outside rdmacm so that even if they may be accessed by multiple threads, the operations applied to one cm_id can be serialized (e.g. rdma_accept() will not race with rdma_destroy_qp()). However, since rdma_get_cm_event() is outside the critical section formed by any particular mutex of the connection bound to the event channel, it may cause data race if QP is accessed inside rdma_get_cm_event() without any synchronization.

We have tried implementing two variants to work around this issue:

When the user wants to close a connection, it will emit a message from IO thread to background thread so that rdma_destroy_qp() and rdma_destroy_id() is scheduled to the background thread calling rdma_get_cm_event().
Create one mutex for each event_channel and use it to serialize rdma_get_cm_event() with other rdmacm operations applied to those cm_ids associated with this channel.

We observed significant performance penalty in terms of the throughput and latency of connection setup in both variants: the average latency increases from 1.6ms to 13.7ms and 8.0ms respectively with one IO thread; and the latency increases from 8.3ms to 18.9ms and 19.8ms when there are 8 IO threads. And the throughput (RPS) of connection setup drops from 738 to 403 and 601, respectively. The performance of connection setup is crucial to the availability of our service.

Jun 10 '22 06:06 FujiZ

I'm confused why your scenario 1 would result in a performance change for connection setup - you don't call rdma_destroy_qp() during connection setup, so why does it run slower? Is it because the connection processing thread is busy doing defered destroy_id instead of accepting new connections? If yes isn't that a problem for latency consistency anyhow for the IO thread?

But you did not answer my question - rdma_destroy_id() cannot be raced with many other functions besides rdma_get_cm_event(), so it seems problematic to have an architecture where another thread is calling rdma_destroy_id(). Did you solve these other cases somehow?

Also, why is holding a mutex inside get_cm_event acceptable to your metrics?

Jun 24 '22 23:06 jgunthorpe

Closed due to missing activity

Sep 21 '22 09:09 rleon

rdma-core
rdma-core copied to clipboard

rdmacm: Fix race condition between destroy_qp and get_cm_event

rdma-core rdma-core copied to clipboard

rdmacm: Fix race condition between destroy_qp and get_cm_event

rdma-core
rdma-core copied to clipboard