grpc icon indicating copy to clipboard operation
grpc copied to clipboard

AsyncNotifyWhenDone is called before the CQ returns the tag

Open ashu-ciena opened this issue 1 year ago • 9 comments

What version of gRPC and what language are you using?

C++ grpc 1.51.0

What operating system (Linux, Windows,...) and version?

Linux 5.10

What did you do?

While streaming of RPC response is in flight, the client disconnects the RPC.

What did you expect to see?

The completion queue returns the tag of last write in flight followed by the tag passed in the AsyncNotifyWhenDone call registration.

What did you see instead?

Sometimes the tag passed in the AsyncNotifyWhenDone call registration is returned by CQ before the last write tag over the stream. As part of handling the tag of AsyncNotifyWhenDone, we clean up the stream object. We see a crash in grpc code then as it may be trying to access the stream for the last write tag.

backtrace#0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x0000007fb1871f18 in __pthread_kill_internal (signo=6, threadid=) at pthread_kill.c:78 #2 0x0000007fb182dd50 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x0000007fb181af64 in __GI_abort () at abort.c:79 #4 0x0000007fb1865b08 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fb1943fb8 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #5 0x0000007fb187c150 in malloc_printerr (str=str@entry=0x7fb193f110 "free(): double free detected in tcache 2") at malloc.c:5659 #6 0x0000007fb187e2f0 in _int_free (av=0x7f98000030, p=0x7f98070aa0, have_lock=0) at malloc.c:4468 #7 0x0000007fb1880a78 in __GI___libc_free (mem=) at malloc.c:3386 #8 0x0000007fafc99be8 in grpc_byte_buffer_destroy (bb=0x7f98070ab0) at /usr/src/debug/grpc/1.59.2-r0/git/src/core/lib/surface/byte_buffer.cc:88 #9 0x0000007fb2a7bb04 in grpc::ByteBuffer::Clear (this=0x7f9801df00) at /usr/include/grpcpp/support/byte_buffer.h:130 #10 grpc::internal::CallOpSendMessage::FinishOp (status=0x7fab8a4fd7, this=0x7f9801def0) at /usr/include/grpcpp/impl/call_op_set.h:332 #11 grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallOpServerSendStatus, grpc::internal::CallNoOp<4>, grpc::internal::CallNoOp<5>, grpc::internal::CallNoOp<6> >::FinalizeResult (this=0x7f9801dec0, tag=0x7fab8a4fd8, status=0x7fab8a4fd7) at /usr/include/grpcpp/impl/call_op_set.h:919 #12 0x0000007faff7e598 in grpc::CompletionQueue::AsyncNextInternal (this=this@entry=0x7f98004f50, tag=tag@entry=0x7fab8a4fd8, ok=ok@entry=0x7fab8a4fd7, deadline=...) at /usr/src/debug/grpc/1.59.2-r0/git/src/cpp/common/completion_queue_cc.cc:157 #13 0x0000007fb2a8a6c0 in grpc::CompletionQueue::Next (ok=0x7fab8a4fd7, tag=0x7fab8a4fd8, this=0x7f98004f50) at /usr/include/grpcpp/completion_queue.h:182 #14 ciena::tmet::grpc::ServerContextImpl::run (this=0x7f98005d30) at ../git/src/transport/grpc/server.cpp:493 #15 0x0000007fb2a89658 in ciena::tmet::grpc::ServerContext::run (this=this@entry=0x22347040) at /usr/include/c++/11.4.0/bits/unique_ptr.h:173 #16 0x0000000000407c34 in grpcServer (svcContext=, pUserData=0x22347040) at ../git/src/main.cpp:79 #17 0x0000007fb221ed3c in svcCzmqActorFn (pPipe=0x2238e2d0, pArgs=0x2238c0a0) at ../git/cncore/src/serviceContext.c:9602 #18 0x0000007fb0c87878 in s_thread_shim (args=0x2238e2b0) at /usr/src/debug/czmq/4.2.1-r0/czmq-4.2.1/src/zactor.c:68 #19 0x0000007fb18702e8 in start_thread (arg=0x7fcc3fc6b7) at pthread_create.c:442 #20 0x0000007fb18d7c5c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

I do not have any recipe of reproducing this 100% but as explained above, if the sequence of returned tags is correct (which is a normal disconnection scenario) , we don't see any crash.

I started seeing this issue frequently after upgrading to 1.51.0 from 1.42.0. Was there any change in completion queue logic / handling in between these release labels?

Please let me know what additional information you want to help in finding the root cause.

I looked at similar issues reported by other folks - #22021 and 10136

ashu-ciena avatar Mar 07 '24 14:03 ashu-ciena

The completion queue returns the tag of last write in flight followed by the tag passed in the AsyncNotifyWhenDone call registration.

Why do you have this expectation? AsyncNotifyWhenDone returns when the RPC has finished. For example, the client might have cancelled the call or there might have been a network disconnect. It is very possible for the tag from AsyncNotifyWhenDone to return earlier compared to normal operations.

I do not have any recipe of reproducing this 100% but as explained above, if the sequence of returned tags is correct (which is a normal disconnection scenario) , we don't see any crash.

From your explanation, it seems like the client is intentionally disconnecting the RPC, so it seems that this behavior is justified. It is likely that between 1.42 and 1.51, there have been internal changes that result in a slightly different timing, but the code shouldn't be depending on such behavior.

Again, I'd like to recommend using the C++ callback API which might be more intuitive to use compared to the cq based API.

yashykt avatar Mar 12 '24 20:03 yashykt

More than 30 days have passed since label "disposition/requires reporter action" was added. Closing this issue. Please feel free to re-open/create a new issue if this is still relevant.

grpc-bot avatar Apr 13 '24 00:04 grpc-bot

@yashykt This issue started occurring more frequently than anticipated. This time I am posting full crash analysis to give the gRPC team an idea what could be going wrong from either their side or our side. As such moving to Callback API could bring lot of regression and performance challenges and a question on how quickly we can send out the data in the Async mode. Can this be a degrade in the performance of the stack?

ashu-ciena avatar Apr 23 '24 07:04 ashu-ciena

@yijiem and @yashykt Can you re-open this issue? Take a look at the attached core dump analysis of each frame. We upgraded the stack to 1.59.2 recently from 1.42.x to address a security vulnerability. There had been no other change in the code, so it appears to be occurring only after the upgrade. I request you to take a look as moving to Callback API as an immediate solution is not possible and we don't want our customers to observe this failure as it is service affecting.

ashu-ciena avatar Apr 29 '24 18:04 ashu-ciena

Thanks @yijiem Before the upgrade, we used to get ok=false on cq->next when client used to suddenly disconnect and we were able to handle it gracefully. But now, the cq->next call does not return and cause a double free exception in this call trace itself. I think somewhere we are missing to set buffer = NULL when it is freed first time.

ashu-ciena avatar Apr 29 '24 18:04 ashu-ciena

@yashykt Can you please take a look at the backtrace analysis once and give some guidance on how we can avoid this crash?

ashu-ciena avatar May 09 '24 07:05 ashu-ciena

@yashykt can we do a conference call to come on the same page and make some progress into this issue?

ashu-ciena avatar May 16 '24 05:05 ashu-ciena

Hi @ashu-ciena, I'm happy to get on a conference call to help with this, but with the way the cq based async API is structured, such a sequence of events is valid.

yashykt avatar May 16 '24 22:05 yashykt

The tag from AsyncNotifyWhenDone just means that the RPC is done/cancelled. It doesn't mean that it is safe to clean up resources. I think there needs to be some kind of state tracking mechanism to check whether it is actually safe to clean up the RPC.

yashykt avatar May 16 '24 22:05 yashykt

More than 30 days have passed since label "disposition/requires reporter action" was added. Closing this issue. Please feel free to re-open/create a new issue if this is still relevant.

grpc-bot avatar Jun 21 '24 00:06 grpc-bot