Fast-DDS icon indicating copy to clipboard operation
Fast-DDS copied to clipboard

Endpoints discovery doesn't recover after Discovery server re-start [13985]

Open lexamor opened this issue 2 years ago • 4 comments

Is there an already existing issue for this?

  • [X] I have searched the existing issues

Expected behavior

A re-start of the Discovery server should not affect the further endpoints (EDP) discovery.

Current behavior

In some cases, after the Discovery server, the re-starts endpoints discovery mechanism does not recover and newly joined entities cannot be discovered, entities "lost" events also don't work.

Steps to reproduce

  1. Runned Discovery Server. fast-discovery-server -i 0

  2. Running at least two different processes, each of them has:

  • process runs on the dedicated host or inside an isolated Docker container.
  • instantiated at least two domain participants in the different domains (e.g. participnat_1 - domain_id=0, participant_2 - domain_id=100).
  • each participant has a set of subscribers/readers and publishers/writers for some topics.
  • one participant (in one domain) uses listening callbacks, another one - WaitSets
  • each participant is instantiated (and is owned) by a separate thread.
  1. Steps to reproduce
  • All instances are running (DS and a few participant's processes).
  • Stop the Discovery server (Ctrl+C) - all entities will receive DS "lost" events.
  • Start the Discovery server.
  • Entities in the domains cannot receive events about endpoints discovery anymore. (not always - sometimes everything is working, or only a single domain is affected)

Fast DDS version/commit

2.5.0

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

UDPv4

Additional context

Some FastDDS internals investigations:

Since, the DS InstanceHandle/GUID is pre-defined for DS built in endpoints, while handling event about DS shutdown (PDPClient::removeRemoteEndpoints), the builtt-in entites with the same GUID are re-created (matched_writer_remove/matched_writer_add, matched_reader_remove/matched_reader_add) and they histories are reset to the initial state.

But, in some cases, the Heartbeat message (with some high sequence numbers) from the DS built-in writer arrives after the "remove" event is handled. (PDPClient::removeRemoteEndpoints finished)

This leads to the situation when the pdpclient notifies (acknack) the server about received SNs which are not available in server writer history, and this ACKNACK is considered by the server as invalid and the warning message is written. 'print_inconsistent_acknack' "Inconsistent acknack received. Local Writer" As a sequence, the client will not receive any discovery data from the server.

Also because of the SNs inconsistency Client also will not answer to any Server heartbeat because the counter of the incoming server heartbeats will be greater than the last handled, and, probably, some other inconsistencies. WriterProxy::process_heartbeat

Unfortunately, attempting to make some hotfixes or implementing a simple test that can easily reproduce an issue is not successful at the moment.

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

lexamor avatar Feb 28 '22 20:02 lexamor

Hi @lexamor ,

Thank you for your thorough report, it's deeply appreciated. I'm afraid what you are describing is the expected behaviour when the server's DATA[Up] is not received by a client. I'll elaborate:

  1. When a DomainParticipant (in this case the SERVER) leaves the DDS network in an orderly manner, it sends a specific RTPS message, i.e. a DATA[Up] (unregister participant). Upon the reception of such DATA[Up], the recipients (in this case the CLIENTS) clean all the data related to that DoaminParticipant, operation which includes un-matching discovery and user readers and writers, resetting reader and writer proxies, etc.
  2. In case a DomainParticipant, for any reason, leaves the DDS network without sending a DATA[Up], or said DATA[Up] is lost in the network, there is a DomainParticipant liveliness mechanism in place which is based on a lease that is asserted periodically by sending a DomainParticipant announcement (DATA[p]). If a DomainParticipant fails to renew this lease, that DoaminParticipant is considered gone from the DDS network perspective, which triggers on the recipient (CLIENT) the same operation as when a DATA[Up] is received.

With those concepts in place, imagine the following scenario (which is behind the behaviour you are experiencing):

  1. The SERVER application is closed correctly, making the SERVER send a DATA[Up] to all its CLIENTS.
  2. One of the CLIENTS does not receive the DATA[Up] (it got lost in the network for whatever reason). Thus, this particular CLIENT will consider the SERVER alive until the lease expires. This means, among other things, that the CLIENT's EDP reader for publications[^a] will think that the SERVER's EDP writer for publications has a certain latest sequence number.
  3. The SERVER is brought back up prior to the lease expiration. This eventually triggers the SERVER sending a DATA[p] to the CLIENT, which the CLIENT interprets as a lease renewal (since from the CLIENT perspective, this is the same SERVER instance, not a new one).
  4. However, now there is a mismatch between the actual state of the SERVER's EDP publications writer and what the CLIENT's EDP publications reader think that writer's state is. This will trigger the inconsistent acknack message, since the reader will be asking for a sample with a sequence number higher than what the writer actual has.

[^a]: I'm taking this endpoint as an example, but the same could happen with other discovery endpoints.

Fortunately, going around the "issue" simply consists in waiting for the lease to expire before bringing the SERVER back up again. By default, the DomainParticipant lease duration is 20 s, and it can be configured both using the C++ API and XML files. In your case, you're instantiating the SERVER using the CLI, so as of right now, there is no way for you to configure the SERVER's lease duration. However, we are in the process of enabling XML configuration for the CLI in #2498.

Please do let me know whether waiting for lease expiration before re-instantiating the server behaves as you would expect.

EduPonz avatar Mar 01 '22 06:03 EduPonz

Hi @EduPonz

Thank you for the great explanation!

But, sometimes I can observe cases when PDP routine between client and server switched to some non-functional state.

My setup: Three isolated docker containers (1 server, 2 clients with multiple participants and domains id=0, id=100):

Scenario: *FastDDS info logs are enabled and category filters applied (SetCategoryFilter(std::regex("(EDP)|(PDP)"));)

Also, I've modified a log output to see to which domain belongs participant while message handling. logInfo(RTPS_PDP_DISCOVERY, "[" << parent_pdp_->getRTPSParticipant()->get_domain_id() <<

  1. All instances are running - everything works.
  2. Stop the discovery server - communication still working.
  • Both clients receive Server "lost" events.
  1. Start Server again. Here the issue can appear.
  • Because of some reasons one participant(process) receives SPDP only for the single domain. And all further discovery in this "missed" domain for this participant will be broken.
  • Another one receives SPDP messages for both domains and everything works well.

Start server command:

./fast-discovery-server -i 1 -l 172.17.0.3 -p 11811
### Server is running ###
  Participant Type:   SERVER
  Server ID:          1
  Server GUID prefix: 44.53.01.5f.45.50.52.4f.53.49.4d.41
  Server Addresses:   UDPv4:[172.17.0.3]:11811

// Start participant(s) - log for both is mainly identical

2022-03-01 20:39:42.855 [RTPS_PDP Info] [100] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 20:39:42.855 [RTPS_PDP_DISCOVERY Info] [100] New participant 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 at MTTLoc: {UNICAST:[UDPv4:[172.17.0.3]:11811]} DefLoc:{UNICAST:[UDPv4:[172.17.0.3]:7413,UDPv4:[172.18.0.2]:7413]} -> Function onNewCacheChangeAdded
2022-03-01 20:39:42.855 [RTPS_PDP Info] [0] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 20:39:42.855 [RTPS_PDP_DISCOVERY Info] [0] New participant 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 at MTTLoc: {UNICAST:[UDPv4:[172.17.0.3]:11811]} DefLoc:{UNICAST:[UDPv4:[172.17.0.3]:7413,UDPv4:[172.18.0.2]:7413]} -> Function onNewCacheChangeAdded
2022-03-01 20:39:43.194 [RTPS_EDP Info] New DPD received, adding remote endpoints to our SimpleEDP endpoints -> Function assignRemoteEndpoints
2022-03-01 20:39:43.194 [RTPS_EDP Info] Adding SEDP Pub Writer to my Pub Reader -> Function assignRemoteEndpoints
2022-03-01 20:39:43.194 [RTPS_EDP Info] Adding SEDP Pub Reader to my Pub Writer -> Function assignRemoteEndpoints
2022-03-01 20:39:43.194 [RTPS_EDP Info] Adding SEDP Sub Writer to my Sub Reader -> Function assignRemoteEndpoints
2022-03-01 20:39:43.194 [RTPS_EDP Info] Adding SEDP Sub Reader to my Sub Writer -> Function assignRemoteEndpoints
2022-03-01 20:39:43.196 [RTPS_EDP Info] New DPD received, adding remote endpoints to our SimpleEDP endpoints -> Function assignRemoteEndpoints
2022-03-01 20:39:43.196 [RTPS_EDP Info] Adding SEDP Pub Writer to my Pub Reader -> Function assignRemoteEndpoints
2022-03-01 20:39:43.196 [RTPS_EDP Info] Adding SEDP Pub Reader to my Pub Writer -> Function assignRemoteEndpoints
2022-03-01 20:39:43.196 [RTPS_EDP Info] Adding SEDP Sub Writer to my Sub Reader -> Function assignRemoteEndpoints
2022-03-01 20:39:43.196 [RTPS_EDP Info] Adding SEDP Sub Reader to my Sub Writer -> Function assignRemoteEndpoints

// Stop server

2022-03-01 21:15:41.150 [RTPS_PDP Info] [100] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 21:15:41.150 [RTPS_PDP Info] [0] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 21:15:41.150 [RTPS_PDP Info] 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 -> Function remove_remote_participant
2022-03-01 21:15:41.150 [RTPS_PDP Info] 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 -> Function remove_remote_participant
2022-03-01 21:15:41.150 [RTPS_EDP Info] For RTPSParticipant: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 -> Function removeRemoteEndpoints
2022-03-01 21:15:41.150 [RTPS_EDP Info] For RTPSParticipant: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 -> Function removeRemoteEndpoints
2022-03-01 21:15:41.150 [RTPS_EDP Info] Adding SEDP Sub Writer to my Sub Reader -> Function removeRemoteEndpoints
2022-03-01 21:15:41.150 [RTPS_EDP Info] Adding SEDP Sub Writer to my Sub Reader -> Function removeRemoteEndpoints
2022-03-01 21:15:41.150 [RTPS_EDP Info] Adding SEDP Sub Reader to my Sub Writer -> Function removeRemoteEndpoints
2022-03-01 21:15:41.150 [RTPS_EDP Info] Adding SEDP Sub Reader to my Sub Writer -> Function removeRemoteEndpoints
2022-03-01 21:15:41.150 [RTPS_PDP Info] For unmatching for server: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 -> Function removeRemoteEndpoints
2022-03-01 21:15:41.150 [RTPS_PDP Info] For unmatching for server: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 -> Function removeRemoteEndpoints
2022-03-01 21:15:41.150 [RTPS_PDP Info] [100] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 21:15:41.150 [RTPS_PDP Info] 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 -> Function remove_remote_participant
2022-03-01 21:15:41.150 [RTPS_PDP Info] [0] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 21:15:41.150 [RTPS_PDP Info] 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 -> Function remove_remote_participant
2022-03-01 20:44:21.677 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.03.00.00.00|0.0.1.c1 PDP announcement -> Function event
2022-03-01 20:44:21.677 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.02.00.00.00|0.0.1.c1 PDP announcement -> Function event
...
2022-03-01 20:44:29.778 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.02.00.00.00|0.0.1.c1 PDP announcement -> Function event
2022-03-01 20:44:29.778 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.03.00.00.00|0.0.1.c1 PDP announcement -> Function event
2022-03-01 20:44:30.228 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.02.00.00.00|0.0.1.c1 PDP announcement -> Function event
2022-03-01 20:44:30.228 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.03.00.00.00|0.0.1.c1 PDP announcement -> Function event

// Start server again // "broken" participant log output:

2022-03-01 20:44:30.238 [RTPS_PDP Info] [0] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 20:44:30.238 [RTPS_PDP_DISCOVERY Info] [0] New participant 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 at MTTLoc: {UNICAST:[UDPv4:[172.17.0.3]:11811]} DefLoc:{UNICAST:[UDPv4:[172.17.0.3]:7413,UDPv4:[172.18.0.2]:7413]} -> Function onNewCacheChangeAdded
// **Message only for the single (id=0) fomain participant received - printed**
2022-03-01 20:44:30.678 [RTPS_EDP Info] New DPD received, adding remote endpoints to our SimpleEDP endpoints -> Function assignRemoteEndpoints
2022-03-01 20:44:30.678 [RTPS_EDP Info] Adding SEDP Pub Writer to my Pub Reader -> Function assignRemoteEndpoints
2022-03-01 20:44:30.678 [RTPS_EDP Info] Adding SEDP Pub Reader to my Pub Writer -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Sub Writer to my Sub Reader -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Sub Reader to my Sub Writer -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.03.00.00.00|0.0.1.c1 PDP announcement -> Function event
2022-03-01 20:44:31.129 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.03.00.00.00|0.0.1.c1 PDP announcement -> Function event
2022-03-01 20:44:31.579 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.03.00.00.00|0.0.1.c1 PDP announcement -> Function event
... 
2022-03-01 20:44:37.880 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.03.00.00.00|0.0.1.c1 PDP announcement -> Function event
2022-03-01 20:44:38.330 [CLIENT_PDP_THREAD Info] Client 01.0f.b1.e9.80.03.6a.97.03.00.00.00|0.0.1.c1 PDP announcement -> Function event

//Sending announcement for the domain 100 will never stop.

// "Good" participant log output:

2022-03-01 20:44:30.238 [RTPS_PDP Info] [100] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 20:44:30.238 [RTPS_PDP Info] [0] SPDP Message received from: 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.1.0.c2 -> Function onNewCacheChangeAdded
2022-03-01 20:44:30.238 [RTPS_PDP_DISCOVERY Info] [0] New participant 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 at MTTLoc: {UNICAST:[UDPv4:[172.17.0.3]:11811]} DefLoc:{UNICAST:[UDPv4:[172.17.0.3]:7413,UDPv4:[172.18.0.2]:7413]} -> Function onNewCacheChangeAdded
2022-03-01 20:44:30.238 [RTPS_PDP_DISCOVERY Info] [100] New participant 44.53.01.5f.45.50.52.4f.53.49.4d.41|0.0.1.c1 at MTTLoc: {UNICAST:[UDPv4:[172.17.0.3]:11811]} DefLoc:{UNICAST:[UDPv4:[172.17.0.3]:7413,UDPv4:[172.18.0.2]:7413]} -> Function onNewCacheChangeAdded
2022-03-01 20:44:30.679 [RTPS_EDP Info] New DPD received, adding remote endpoints to our SimpleEDP endpoints -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] New DPD received, adding remote endpoints to our SimpleEDP endpoints -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Pub Writer to my Pub Reader -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Pub Writer to my Pub Reader -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Pub Reader to my Pub Writer -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Pub Reader to my Pub Writer -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Sub Writer to my Sub Reader -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Sub Writer to my Sub Reader -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Sub Reader to my Sub Writer -> Function assignRemoteEndpoints
2022-03-01 20:44:30.679 [RTPS_EDP Info] Adding SEDP Sub Reader to my Sub Writer -> Function assignRemoteEndpoints

// Participant works as expected

I would be very appreciative if you have any ideas which can explain or maybe helps investigate better such behavior.

lexamor avatar Mar 01 '22 21:03 lexamor

Hi @EduPonz , seems (hopefully), the root cause of the mentioned issue has been found:

Discovery server while generation message which contains DATA(UP) sub-message with the following HEATBEAT stores the generated DATA in the discovery_db and tries to send it from the separate thread. In some cases, it leads to the duplication of sending DATA(UP) and, as consequence, transitions CLIENT reader history to an inconsistent state, when it cannon anymore communicate with the re-started DS_SERVER because of sequence numbers inconsistency.

The following SERVER patch resolved the issue for me, but unfortunately, it is not clear how it can be handled (e.g. add some additional checks) on the CLIENT side:

diff --git a/src/cpp/rtps/builtin/discovery/participant/PDPServer.cpp b/src/cpp/rtps/builtin/discovery/participant/PDPServer.cpp
index d4227d30c..9486f6498 100644
--- a/src/cpp/rtps/builtin/discovery/participant/PDPServer.cpp
+++ b/src/cpp/rtps/builtin/discovery/participant/PDPServer.cpp
@@ -704,20 +704,6 @@ void PDPServer::announceParticipantState(
             // Assign identity
             change->sequenceNumber = sn;
             change->write_params = std::move(wp);
-
-            // Update the database with our own data
-            if (discovery_db().update(change, ddb::DiscoveryParticipantChangeData()))
-            {
-                // Distribute
-                awake_routine_thread();
-            }
-            else
-            {
-                // Dispose if already there
-                // It may happen if the participant is not removed fast enough
-                mp_PDPWriter->release_change(change);
-                return;
-            }
         }
         else
         {
       ```
  

lexamor avatar Mar 15 '22 11:03 lexamor

Hi @lexamor ,

We have been introducing several changes in the last couple of weeks which affected, among other things, the Discovery Server mechanism. These changes were able to solve similar issues to this one, so we would greatly appreciate if you were to test again your scenario with the most recent version of the library, and kindly let us now whether this issue still persists.

Cheers!

juanlofer-eprosima avatar Jul 18 '22 12:07 juanlofer-eprosima

I am going to proceed and close this issue as it seems that the question was answered and a solution provided and no feedback has been received.

JLBuenoLopez avatar Apr 13 '23 06:04 JLBuenoLopez