Fast-DDS
Fast-DDS copied to clipboard
Probabilistic exception (nullptr access to RTPSReader listener) when deleting a DataReader
Is there an already existing issue for this?
- [X] I have searched the existing issues
Expected behavior
The destruction of DDS entities is performed correctly
Current behavior
Access by nullptr pointer
Steps to reproduce
Developed a separate test consisting of one DataReader and one DataWriter (each runs in a separate application). On the part of the DataWriter, the partition name is changed every 2-5 ms to ensure that the discovery mechanism is triggered with a high frequency. The running DataReader is created and deleted every 10/100 ms.
Fast DDS version/commit
2.9.1
Platform/Architecture
Ubuntu Focal 20.04 amd64
Transport layer
UDPv4
Additional context
Nullptr access occurs when one of the methods onReaderMatched(...)
is called in the file EDP.cpp
//MATCHED AND ADDED CORRECTLY:
if (r.getListener() != nullptr)
{
MatchingInfo info;
info.status = MATCHED_MATCHING;
info.remoteEndpointGuid = writer_guid;
r.getListener()->onReaderMatched(&r, info);
const SubscriptionMatchedStatus& sub_info =
update_subscription_matched_status(readerGUID, writer_guid, 1);
r.getListener()->onReaderMatched(&r, sub_info);
}
Code analysis showed that the reason for the error is race condition when accessing the pointer to the RTPS listener in EDP against the background of the ~DataReaderImpl()
destructor, which performs an unsecured reset of the RTPS listener in the disable()
method:
void DataReaderImpl::disable()
{
set_listener(nullptr);
if (reader_ != nullptr)
{
reader_->setListener(nullptr);
}
}
The proposed fix is to exclude the disable()
call in the destructor so that it looks like this:
DataReaderImpl::~DataReaderImpl()
{
// assert there are no pending conditions
assert(read_conditions_.empty());
// Disable the datareader to prevent receiving data in the middle of deleting it
// disable();
stop();
delete user_datareader_;
}
In this case, all the necessary locks are provided during the execution of the RTPSParticipantImpl::deleteUserEndpoint()
method. At the same time, the test also stopped giving an execution error.
XML configuration file
No response
Relevant log output
No response
Network traffic capture
No response
Hi @i-and,
Thanks for the report.
Improvements to listener safety were done on PR #2898, merged into master after the 2.7.1 version release. Would you mind running the test again with a newer version? Version 2.7.1 (and the 2.7.x branch as a whole) is no longer a supported version (you can check the current supported releases here).
Hi @jsantiago-eProsima,
I ran a test today on version 2.9.1 - the specified error also occurs there as well as on version 2.7.1. Removing the disable()
call in DataReaderImpl::~DataReaderImpl()
also fixes the error.
Hi @i-and
The error has not been able to being replicated. The disable()
method should be called to stop listening, so removing it should not be a possible fix.
Would you mind sharing a reproducible example, so we can analyze deeply?
It turned out that the probability of this error in EDP strongly depends on the operating system used (Windows/Linux) and the number of cores in the CPU. In order to achieve a guaranteed fault, I made an additional synchronization in the code. As a result, using the examples/cpp/Filtering/ example, it was possible to stably reproduce the error.
To do this, the application ./DDSFiltering publisher is launched first, then another application ./DDSFiltering subscriber fast is launched in another terminal or under the debugger. The figure shows the result of an error on the part of the subscriber. As can be seen from the debugger panel "VARIABLES" r.mp_listener
is 0, which leads to Segmentation fault.
I also attach the modified 3 files.
EDP.cpp.txt
FilteringExampleSubscriber.cxx.txt
FilteringExamplePublisher.cxx.txt
The disable() method should be called to stop listening, so removing it should not be a possible fix.
After recreating this error, I suggest considering fixing it in the way indicated above (exclude the disable()
from ~DataReaderImpl()
). It seems to me that this method is most preferable for the following reasons:
- based on the code, the occurrence of a callback in the ~DataReaderImpl() process should be worked out correctly thanks to existing mutexes;
- new locks are not introduced, but existing ones are used;
- in the code EDP.cpp there are about ten more potentially dangerous places with the getListener() call - all of them are eliminated by this patch.
Hi @i-and,
Thanks for the attached files that helped me reproduce the issue.
As @JesusPoderoso said, the solution should not be avoiding the disable()
call.
Please, check if the issue persists after applying this patch.
Hi @i-and, Thanks for the attached files that helped me reproduce the issue. As @JesusPoderoso said, the solution should not be avoiding the
disable()
call.Please, check if the issue persists after applying this patch.
Hi @Mario-DL We also meet this bug on the v2.12 branch,this bug can be fixed? this is coredump:
use gdb show the listener is nullptr:
Hi @TechVortexZ
We have not included a fix for this yet, although we are aware of it. Could you double check if introducing a std::lock_guard<RecursiveTimedMutex> guard(r.getMutex());
here makes any difference ?
BTW, 2.12.x
branch was EOL
some months ago, we encourage you to move on to 2.14.x
. It will be also interesting if you could check whether the issue is reproducible there.
Hi @Mario-DL
- This patch is not enough, as it does not protect the following code locations in the same method
EDP::pairing_writer_proxy_with_any_local_reader()
: https://github.com/eProsima/Fast-DDS/blob/092848725b8425e4f05a8ccf7b3b8d513fabf733/src/cpp/rtps/builtin/discovery/endpoint/EDP.cpp#L1517-L1520 https://github.com/eProsima/Fast-DDS/blob/092848725b8425e4f05a8ccf7b3b8d513fabf733/src/cpp/rtps/builtin/discovery/endpoint/EDP.cpp#L1530-L1540 - It seems that it would be logical to protect similar three places in the method
EDP::pairing_reader_proxy_with_any_local_writer()
since it is structurally similar to the methodEDP::pairing_writer_proxy_with_any_local_reader()
under consideration. - I propose to implement the
on_reader_discovery()
method call in the methodsStatefulWriter::matched_reader_add()
,StatefulWriter::matched_reader_remove()
,StatelessWriter::matched_reader_add()
andStatelessWriter::matched_reader_remove()
by analogy, as it is done in the corresponding four methods of the classesStatefulReader
andStatelessReader
: namely, use the intermediate local variableReaderListener* listener
to call, which is initialized with themp_listener
value on the mutexstd::unique_lock<RecursiveTimedMutex> guard(mp_mutex)
.