AsyncNotifyWhenDone is called before the CQ returns the tag
What version of gRPC and what language are you using?
C++ grpc 1.51.0
What operating system (Linux, Windows,...) and version?
Linux 5.10
What did you do?
While streaming of RPC response is in flight, the client disconnects the RPC.
What did you expect to see?
The completion queue returns the tag of last write in flight followed by the tag passed in the AsyncNotifyWhenDone call registration.
What did you see instead?
Sometimes the tag passed in the AsyncNotifyWhenDone call registration is returned by CQ before the last write tag over the stream. As part of handling the tag of AsyncNotifyWhenDone, we clean up the stream object. We see a crash in grpc code then as it may be trying to access the stream for the last write tag.
backtrace#0 __pthread_kill_implementation (threadid=
I do not have any recipe of reproducing this 100% but as explained above, if the sequence of returned tags is correct (which is a normal disconnection scenario) , we don't see any crash.
I started seeing this issue frequently after upgrading to 1.51.0 from 1.42.0. Was there any change in completion queue logic / handling in between these release labels?
Please let me know what additional information you want to help in finding the root cause.
I looked at similar issues reported by other folks - #22021 and 10136
The completion queue returns the tag of last write in flight followed by the tag passed in the AsyncNotifyWhenDone call registration.
Why do you have this expectation? AsyncNotifyWhenDone returns when the RPC has finished. For example, the client might have cancelled the call or there might have been a network disconnect. It is very possible for the tag from AsyncNotifyWhenDone to return earlier compared to normal operations.
I do not have any recipe of reproducing this 100% but as explained above, if the sequence of returned tags is correct (which is a normal disconnection scenario) , we don't see any crash.
From your explanation, it seems like the client is intentionally disconnecting the RPC, so it seems that this behavior is justified. It is likely that between 1.42 and 1.51, there have been internal changes that result in a slightly different timing, but the code shouldn't be depending on such behavior.
Again, I'd like to recommend using the C++ callback API which might be more intuitive to use compared to the cq based API.
More than 30 days have passed since label "disposition/requires reporter action" was added. Closing this issue. Please feel free to re-open/create a new issue if this is still relevant.
@yashykt This issue started occurring more frequently than anticipated. This time I am posting full crash analysis to give the gRPC team an idea what could be going wrong from either their side or our side. As such moving to Callback API could bring lot of regression and performance challenges and a question on how quickly we can send out the data in the Async mode. Can this be a degrade in the performance of the stack?
@yijiem and @yashykt Can you re-open this issue? Take a look at the attached core dump analysis of each frame. We upgraded the stack to 1.59.2 recently from 1.42.x to address a security vulnerability. There had been no other change in the code, so it appears to be occurring only after the upgrade. I request you to take a look as moving to Callback API as an immediate solution is not possible and we don't want our customers to observe this failure as it is service affecting.
Thanks @yijiem Before the upgrade, we used to get ok=false on cq->next when client used to suddenly disconnect and we were able to handle it gracefully. But now, the cq->next call does not return and cause a double free exception in this call trace itself. I think somewhere we are missing to set buffer = NULL when it is freed first time.
@yashykt Can you please take a look at the backtrace analysis once and give some guidance on how we can avoid this crash?
@yashykt can we do a conference call to come on the same page and make some progress into this issue?
Hi @ashu-ciena, I'm happy to get on a conference call to help with this, but with the way the cq based async API is structured, such a sequence of events is valid.
The tag from AsyncNotifyWhenDone just means that the RPC is done/cancelled. It doesn't mean that it is safe to clean up resources. I think there needs to be some kind of state tracking mechanism to check whether it is actually safe to clean up the RPC.
More than 30 days have passed since label "disposition/requires reporter action" was added. Closing this issue. Please feel free to re-open/create a new issue if this is still relevant.