gloo
gloo copied to clipboard
I am able to reproduce #147
Hi @pietern,
I get this error consistently. Error 5 is [IBV_WC_WR_FLUSH_ERR] = "Work Request Flushed Error",
I did some debug, and it is due to the pair destructor code not considering that there are posted receive wqes (the ones used for memory registration, I believe) when it is called. When the destructor calls ibv_destroy_qp(), the RoCE library is supposed to clean up those wqes and return them as error cqes to the application. From IB Spec Section 11.6.2 Work Request Flushed Error - A Work Request was in process or outstanding when the QP transitioned into the Error State.
So when destroy_qp() is called, the device thread gets another event. While the destructor is executing. This has two effects:
- Pair::handleCompletionEvent() gets called, and that eventually leads to Pair:461 throwing an exception because it is not expecting any cqes with an error status.
- at this point there is an unacked event (ibv_ack_cq_events() was called earlier) so if the code is changed to survive the exception, there will be a deadlock inside ibv_destroy_cq(). I have a workaround for both bugs. It is not the most elegant (there must be some RoCE implementation that does not behave according to spec, so the "flush" event may never be delivered, I cannot count on it) so I am just waiting for some time for the event to be possibly delivered, but if you are interested I have a patch. Probably you can figure out a nicer solution. The best would be to disable event delivery before the destructor is called, and trigger one more event without errors, so the event mechanism gets disarmed. Calling setSync(true,true) in the destructor, which I am doing in my fix, is too late: the event is still delivered, that is enough only to prevent handleCompletionEvent being called in the device thread context. Regards, -Gug
https://github.com/facebookincubator/gloo/issues/147
@guglielmo-morandin I saw the same problem and your patch fixed the issue. You should submit a pull request so this can be discussed/committed.