C++ data race with Communicator::stringToProxy and Communicator::destroy
While porting Swift to use async/await I got this crash
* thread #2, queue = 'com.apple.root.default-qos.cooperative', stop reason = EXC_BAD_ACCESS (code=1, address=0x5f)
* frame #0: 0x0000000105081c6c TestDriver`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::__is_long[abi:ue170006](this="") const at string:1734:33
frame #1: 0x0000000105081c1c TestDriver`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::__get_pointer[abi:ue170006](this="") const at string:1869:17
frame #2: 0x0000000105081bdc TestDriver`std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::data[abi:ue170006](this="") const at string:1559:73
frame #3: 0x00000001050ca1c4 TestDriver`std::__1::basic_ostream<char, std::__1::char_traits<char>>& std::__1::operator<<[abi:ue170006]<char, std::__1::char_traits<char>, std::__1::allocator<char>>(__os=0x000000016af15020, __str="") at ostream:1093:56
frame #4: 0x00000001051272c4 TestDriver`IceInternal::EndpointI::initWithOptions(this=0x000060000166c418, args=size=2) at EndpointI.cpp:49:17
frame #5: 0x0000000105138c54 TestDriver`IceInternal::IPEndpointI::initWithOptions(this=0x000060000166c418, args=size=2, oaEndpoint=false) at IPEndpointI.cpp:378:16
frame #6: 0x000000010546ed10 TestDriver`IceInternal::TcpEndpointFactory::create(this=0x00006000024743d8, args=size=2, oaEndpoint=false) const at TcpEndpointI.cpp:364:12
frame #7: 0x000000010512328c TestDriver`IceInternal::EndpointFactoryManager::create(this=0x0000600000474398, str="tcp -p 12010", oaEndpoint=false) const at EndpointFactoryManager.cpp:108:35
frame #8: 0x00000001053fdeb8 TestDriver`IceInternal::ReferenceFactory::create(this=0x0000600001f74018, str="test:tcp -p 12010", propertyPrefix="") at ReferenceFactory.cpp:469:74
frame #9: 0x00000001050989e8 TestDriver`Ice::Communicator::_stringToProxy(this=0x0000600003f4c658, s="test:tcp -p 12010") const at Communicator.cpp:84:43
frame #10: 0x00000001056e37d0 TestDriver`std::__1::optional<Ice::ObjectPrx> Ice::Communicator::stringToProxy<Ice::ObjectPrx, true>(this=0x0000600003f4c658, str="test:tcp -p 12010") const at Communicator.h:95:30
frame #11: 0x00000001056e35c8 TestDriver`-[ICECommunicator stringToProxy:error:](self=0x0000600002a703a0, _cmd="stringToProxy:error:", str="test:tcp -p 12010", error=0x000000016af16a70) at Communicator.mm:51:39
frame #12: 0x0000000104fab6bc TestDriver`closure #1 in CommunicatorI.stringToProxyImpl<ProxyImpl>(self=0x0000600000e50d80, str="test:tcp -p 12010") at CommunicatorI.swift:251:46
frame #13: 0x0000000104fad9c4 TestDriver`partial apply for closure #1 in CommunicatorI.stringToProxyImpl<A>(_:) at <compiler-generated>:0
frame #14: 0x00000001a2d1c54c libswiftObjectiveC.dylib`ObjectiveC.autoreleasepool<τ_0_0>(invoking: () throws -> τ_0_0) throws -> τ_0_0 + 64
frame #15: 0x0000000104fab14c TestDriver`CommunicatorI.stringToProxyImpl<ProxyImpl>(str="test:tcp -p 12010", self=0x0000600000e50d80) at CommunicatorI.swift:250:20
frame #16: 0x0000000104fa6ca8 TestDriver`CommunicatorI.stringToProxy(str="test:tcp -p 12010", self=0x0000600000e50d80) at CommunicatorI.swift:48:13
frame #17: 0x0000000104fabb64 TestDriver`protocol witness for Communicator.stringToProxy(_:) in conformance CommunicatorI at <compiler-generated>:0
frame #18: 0x00000001057c0358 TestDriver`closure #1 in allTests(output=0x0000600002860230, comm=0x0000600000e50d80, ref="test:tcp -p 12010") at AllTests.swift:60:36
frame #19: 0x00000001057c0d44 TestDriver`partial apply for closure #1 in allTests(_:) at <compiler-generated>:0
frame #20: 0x0000000104fa2214 TestDriver`thunk for @escaping @callee_guaranteed @Sendable @async () -> (@out A) at <compiler-generated>:0
frame #21: 0x0000000104fa2370 TestDriver`partial apply for thunk for @escaping @callee_guaranteed @Sendable @async () -> (@out A) at <compiler-generated>:0
The code in question was running stringToProxy and destroy at the same time.
Task {
try communicator.stringToProxy(ref)
}
communicator.destory()
It is not clear to me that this is a race condition issue. Doesn't Task create a detached task that could run after main exits?
That was just a snippet of the test that was failing. It was not exiting main for a while after.
It's reproducible in just C++ .
#include <Ice/Ice.h>
int main(int argc, char *argv[])
{
int count = 0;
for (;;)
{
Ice::CommunicatorHolder ich(argc, argv);
auto communicator = ich.communicator();
std::thread t([communicator]
{
for (;;)
{
try {
communicator->stringToProxy("test:tcp -h 127.0.0.1 -p 10000");
} catch (Ice::CommunicatorDestroyedException &e) {
break;
}
} });
communicator->destroy();
t.join();
}
return 0;
}
❯ clang++ -std=c++17 test.cpp -lIce -I/Users/joe/Developer/zeroc-ice/ice/cpp/include -I/Users/joe/Developer/zeroc-ice/ice/cpp/include/generated -L/Users/joe/Developer/zeroc-ice/ice/cpp/lib -Wl,-rpath,/Users/joe/Developer/zeroc-ice/ice/cpp/lib -o test
❯ ./test
fish: Job 1, './test' terminated by signal SIGSEGV (Address boundary error)
After running this a bunch there seems to be a variety of failures you can get.
- invalid endpoint
- SIGSEGV (Address boundary error)
- EXC_BAD_ACCESS
One issue is that endpoint factories set the instance member to nullptr during destroy, but access it without chekcing.
For example:
https://github.com/zeroc-ice/ice/blob/1ae9e22a64a2f8c80bdf5ad280f71064a4000c64/cpp/src/Ice/TcpEndpointI.cpp#L373-L377
and:
https://github.com/zeroc-ice/ice/blob/1ae9e22a64a2f8c80bdf5ad280f71064a4000c64/cpp/src/Ice/TcpEndpointI.cpp#L353-L357
A second issue is the endpoint factory manager clears the factories vector, while it is in use:
For example:
https://github.com/zeroc-ice/ice/blob/1ae9e22a64a2f8c80bdf5ad280f71064a4000c64/cpp/src/Ice/EndpointFactoryManager.cpp#L190-L197
and:
https://github.com/zeroc-ice/ice/blob/1ae9e22a64a2f8c80bdf5ad280f71064a4000c64/cpp/src/Ice/EndpointFactoryManager.cpp#L92-L103
We can fix this one by holding the mutex in destroy.
For "invalid endpoint" it is probably because, the factory was destroyed and removed, before it has a chance to parse the endpoint. We can fix ReferenceFactory to throw CommunicatorDestroyedException in this case.
I also got this crash with the same test:
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x180)
* frame #0: 0x00000001023abb00 libIce.38a0.dylib`std::__1::shared_ptr<IceInternal::DefaultsAndOverrides>::get[abi:ue170006](this=0x0000000000000180) const at shared_ptr.h:871:16
frame #1: 0x000000010239bee0 libIce.38a0.dylib`std::__1::shared_ptr<IceInternal::DefaultsAndOverrides>::operator bool[abi:ue170006](this=0x0000000000000180) const at shared_ptr.h:903:16
frame #2: 0x000000010239be80 libIce.38a0.dylib`IceInternal::Instance::defaultsAndOverrides(this=0x0000000000000000) const at Instance.cpp:289:5
frame #3: 0x000000010259234c libIce.38a0.dylib`IceInternal::ProtocolInstance::defaultSourceAddress(this=0x0000600000b14518) const at ProtocolInstance.cpp:103:23
frame #4: 0x0000000102351264 libIce.38a0.dylib`IceInternal::IPEndpointI::initWithOptions(this=0x0000600001914018, args=size=0, oaEndpoint=false) at IPEndpointI.cpp:408:56
(lldb) frame select 3
frame #3: 0x00000001025922a8 libIce.38a0.dylib`IceInternal::ProtocolInstance::defaultSourceAddress(this=0x0000600001f80218) const at ProtocolInstance.cpp:103:23
100 const Address&
101 IceInternal::ProtocolInstance::defaultSourceAddress() const
102 {
-> 103 return _instance->defaultsAndOverrides()->defaultSourceAddress;
104 }
105
106 const EncodingVersion&
(lldb) print _instance
(const IceInternal::InstancePtr) nullptr {
__ptr_ = nullptr
See also https://github.com/zeroc-ice/ice/tree/v3.7.2.1-java-npe-fix