hazelcast-cpp-client icon indicating copy to clipboard operation
hazelcast-cpp-client copied to clipboard

Test executable crashes when running ReliableTopicTest

Open yemreinci opened this issue 3 years ago • 7 comments

The test executable occasionally crashes on GA while executing the tests under the suite ReliableTopicTest.

Here is a run that crashed: https://github.com/hazelcast/hazelcast-cpp-client/runs/2559526252?check_suite_focus=true

yemreinci avatar May 11 '21 21:05 yemreinci

Can you also attached the stack trace and log here in case the GA build log is lost?

ihsandemir avatar May 12 '21 08:05 ihsandemir

Can you also attached the stack trace and log here in case the GA build log is lost?

2_Ubuntu-x86_64 (Debug, Static, SSL).txt

yemreinci avatar May 12 '21 08:05 yemreinci

This crash happens when the thread spawned for the continuation handler in ClientInvocation::invoke outlive the client object and try to access the executor_ for scheduling the next continuation handler. Since the thread is created freely, it isn't managed by a thread pool or the client, and it can still be running after the client object is destroyed.

yemreinci avatar Aug 27 '21 07:08 yemreinci

When I look at the code reliable_topic::on_shutdown cancels the runner and we have the code that waits for executor threads to finish at this line during shutdown. Hence, I was expecting the active threads to finish gracefully before the client destructor destructs the objects. The general logic is to close all outstanding threads on client shutdown and then the client is destructed. Did you check if any outstanding such thread lives following the shutdown?

ihsandemir avatar Aug 27 '21 08:08 ihsandemir

Hence, I was expecting the active threads to finish gracefully before the client destructor destructs the objects.

Yes, but the thread I mentioned is not managed by the client. Notice that the continuation handler is not bound to a specific executor, so a new and independent thread is created for it. And after executing its continuation handler (which can happen after the client object is long gone), it now wants to submit the job for the next handler, which is this. To do that, it needs to access the executor_ and it is destroyed. It doesn't matter whether you cancelled the message runner or joined all the other threads, as long as this is a free thread, it can outlive everything and try to access that destroyed executor for the next handler, all of which happen within the Boost future library.

Did you check if any outstanding such thread lives following the shutdown?

Yes, the thread is listed as Thread 1 (Thread 0x7fd926ffd700 (LWP 135427)): in the above log. It's left over from the previous test ReliableTopicTest.testConfig.

yemreinci avatar Aug 27 '21 10:08 yemreinci

OK, I see, I looked at the wrong completion. Yes, I remember that I had to do this since a user thread was being stuck waiting an invocation response(future.get) in one of the tests(it may be invocation_should_not_block_indefinitely_during_client_shutdown test) when client is shutdown and there is no other thread to notify the user thread. Normally, I expect the other completion to be effective but there was a problem that it was not working. I just tried the test with commenting out the lines that you mention it seems to pass on Mac OS, but it may be happening randomly or on linux, need to test further. I would like to remove those lines if possible, we just need to make sure that no such problem occurs as user getting stuck on future.get.

ihsandemir avatar Aug 27 '21 11:08 ihsandemir

Related to #852

hakanaktas0 avatar Aug 27 '22 23:08 hakanaktas0

Solved with PR #1071

akeles85 avatar Feb 06 '23 07:02 akeles85