Fast-DDS icon indicating copy to clipboard operation
Fast-DDS copied to clipboard

Probabilistic exception (nullptr access to RTPSReader listener) when deleting a DataReader

Open i-and opened this issue 2 years ago • 8 comments

Is there an already existing issue for this?

  • [X] I have searched the existing issues

Expected behavior

The destruction of DDS entities is performed correctly

Current behavior

Access by nullptr pointer

Steps to reproduce

Developed a separate test consisting of one DataReader and one DataWriter (each runs in a separate application). On the part of the DataWriter, the partition name is changed every 2-5 ms to ensure that the discovery mechanism is triggered with a high frequency. The running DataReader is created and deleted every 10/100 ms.

Fast DDS version/commit

2.9.1

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

UDPv4

Additional context

Nullptr access occurs when one of the methods onReaderMatched(...) is called in the file EDP.cpp

//MATCHED AND ADDED CORRECTLY:
if (r.getListener() != nullptr)
{
  MatchingInfo info;
  info.status = MATCHED_MATCHING;
  info.remoteEndpointGuid = writer_guid;
  r.getListener()->onReaderMatched(&r, info);

  const SubscriptionMatchedStatus& sub_info =
       update_subscription_matched_status(readerGUID, writer_guid, 1);
  r.getListener()->onReaderMatched(&r, sub_info);
}

Code analysis showed that the reason for the error is race condition when accessing the pointer to the RTPS listener in EDP against the background of the ~DataReaderImpl() destructor, which performs an unsecured reset of the RTPS listener in the disable() method:

void DataReaderImpl::disable()
{
    set_listener(nullptr);
    if (reader_ != nullptr)
    {
        reader_->setListener(nullptr);
    }
}

The proposed fix is to exclude the disable() call in the destructor so that it looks like this:

DataReaderImpl::~DataReaderImpl()
{
    // assert there are no pending conditions
    assert(read_conditions_.empty());

    // Disable the datareader to prevent receiving data in the middle of deleting it
    // disable();

    stop();

    delete user_datareader_;
}

In this case, all the necessary locks are provided during the execution of the RTPSParticipantImpl::deleteUserEndpoint() method. At the same time, the test also stopped giving an execution error.

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

i-and avatar Feb 15 '23 19:02 i-and

Hi @i-and,

Thanks for the report.

Improvements to listener safety were done on PR #2898, merged into master after the 2.7.1 version release. Would you mind running the test again with a newer version? Version 2.7.1 (and the 2.7.x branch as a whole) is no longer a supported version (you can check the current supported releases here).

jsan-rt avatar Feb 20 '23 07:02 jsan-rt

Hi @jsantiago-eProsima,

I ran a test today on version 2.9.1 - the specified error also occurs there as well as on version 2.7.1. Removing the disable() call in DataReaderImpl::~DataReaderImpl() also fixes the error.

i-and avatar Feb 20 '23 20:02 i-and

Hi @i-and

The error has not been able to being replicated. The disable() method should be called to stop listening, so removing it should not be a possible fix.

Would you mind sharing a reproducible example, so we can analyze deeply?

JesusPoderoso avatar Mar 09 '23 10:03 JesusPoderoso

It turned out that the probability of this error in EDP strongly depends on the operating system used (Windows/Linux) and the number of cores in the CPU. In order to achieve a guaranteed fault, I made an additional synchronization in the code. As a result, using the examples/cpp/Filtering/ example, it was possible to stably reproduce the error. To do this, the application ./DDSFiltering publisher is launched first, then another application ./DDSFiltering subscriber fast is launched in another terminal or under the debugger. The figure shows the result of an error on the part of the subscriber. As can be seen from the debugger panel "VARIABLES" r.mp_listener is 0, which leads to Segmentation fault. fault_edp I also attach the modified 3 files. EDP.cpp.txt FilteringExampleSubscriber.cxx.txt FilteringExamplePublisher.cxx.txt

The disable() method should be called to stop listening, so removing it should not be a possible fix.

After recreating this error, I suggest considering fixing it in the way indicated above (exclude the disable() from ~DataReaderImpl()). It seems to me that this method is most preferable for the following reasons:

  • based on the code, the occurrence of a callback in the ~DataReaderImpl() process should be worked out correctly thanks to existing mutexes;
  • new locks are not introduced, but existing ones are used;
  • in the code EDP.cpp there are about ten more potentially dangerous places with the getListener() call - all of them are eliminated by this patch.

i-and avatar Mar 25 '23 17:03 i-and

Hi @i-and, Thanks for the attached files that helped me reproduce the issue. As @JesusPoderoso said, the solution should not be avoiding the disable() call.

Please, check if the issue persists after applying this patch.

Mario-DL avatar Apr 25 '23 09:04 Mario-DL

Hi @i-and, Thanks for the attached files that helped me reproduce the issue. As @JesusPoderoso said, the solution should not be avoiding the disable() call.

Please, check if the issue persists after applying this patch.

Hi @Mario-DL We also meet this bug on the v2.12 branch,this bug can be fixed? this is coredump:

image

use gdb show the listener is nullptr: image

TechVortexZ avatar Jul 02 '24 01:07 TechVortexZ

Hi @TechVortexZ We have not included a fix for this yet, although we are aware of it. Could you double check if introducing a std::lock_guard<RecursiveTimedMutex> guard(r.getMutex()); here makes any difference ? BTW, 2.12.x branch was EOL some months ago, we encourage you to move on to 2.14.x. It will be also interesting if you could check whether the issue is reproducible there.

Mario-DL avatar Jul 02 '24 05:07 Mario-DL

Hi @Mario-DL

  1. This patch is not enough, as it does not protect the following code locations in the same method EDP::pairing_writer_proxy_with_any_local_reader(): https://github.com/eProsima/Fast-DDS/blob/092848725b8425e4f05a8ccf7b3b8d513fabf733/src/cpp/rtps/builtin/discovery/endpoint/EDP.cpp#L1517-L1520 https://github.com/eProsima/Fast-DDS/blob/092848725b8425e4f05a8ccf7b3b8d513fabf733/src/cpp/rtps/builtin/discovery/endpoint/EDP.cpp#L1530-L1540
  2. It seems that it would be logical to protect similar three places in the method EDP::pairing_reader_proxy_with_any_local_writer() since it is structurally similar to the method EDP::pairing_writer_proxy_with_any_local_reader() under consideration.
  3. I propose to implement the on_reader_discovery() method call in the methods StatefulWriter::matched_reader_add(), StatefulWriter::matched_reader_remove(), StatelessWriter::matched_reader_add() and StatelessWriter::matched_reader_remove() by analogy, as it is done in the corresponding four methods of the classes StatefulReader and StatelessReader: namely, use the intermediate local variable ReaderListener* listener to call, which is initialized with the mp_listener value on the mutex std::unique_lock<RecursiveTimedMutex> guard(mp_mutex).

i-and avatar Jul 16 '24 21:07 i-and