Fast-DDS icon indicating copy to clipboard operation
Fast-DDS copied to clipboard

Subscriber does not include locators for new interfaces in the list of subscriptions [14659]

Open jsnykan opened this issue 2 years ago • 2 comments

Is there an already existing issue for this?

  • [X] I have searched the existing issues

Expected behavior

When using dynamic network interfaces feature (https://fast-dds.docs.eprosima.com/en/latest/fastdds/use_cases/dynamic_network_interfaces/dynamic_network_interfaces.html) the Fast-DDS participant should successfully subscribe to and receive data from a publisher via a newly added network interface.

Current behavior

The Fast-DDS participants see each other, but the subscriber will not receive topic messages from the publisher as the subscriber does not include a locator for the newly added interface in the subscription list.

Steps to reproduce

Using project attached (example_project.tar.gz).

example_project.tar.gz

Project description:

Publisher (pub executable) publishes one topic ("HelloWorldTopic") which includes uint32_t as a data, and publishes data to this topic with 100ms frequency using DataWriter. Also calls DomainParticipant::set_qos() with 100ms frequency.

Subscriber (sub executable) subscribes to one topic ("HelloWorldTopic", the same topic pub has published) and tries periodically to read topic data with DataReader. If there is no data, then it calls DomainParticipant::set_qos() and Subscriber::set_qos(). (Calling Subscriber::set_qos() is part of the hack I describe in "Additional context" section - this issue occurs without that call as well.)

I have the following test setup:

Running publisher (pub executable) on computer1 (ip address 192.168.20.160) Running subscriber (sub executable) on computer2 (ip address 192.168.20.12) computer1 and computer2 network cables are connected in a same dumb switch

Steps to reproduce:

  1. run pub executable on computer1
  2. disconnect network cable from computer2
  3. run sub executable on computer2
  4. reconnect network cable to computer 2

Note that if pub and sub are switched to different computers so that pub is run on computer2 and sub on computer1 and similar steps are run (network cable is disconnected+reconnected on computer2), then the feature seems to work as documented without any modifications needed to Fast-DDS library (and no extra Subscriber::set_qos() call needed, either).

Fast DDS version/commit

Same problem exists with both v2.6.0 and master (c8a9f196517d4)

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

UDPv4

Additional context

I captured network traffic while doing the steps listed in "steps to reproduce". This network capture is attached to this issue (tcpdump_without_fix.zip).

tcpdump_without_fix.zip

Based on that capture it looks to me that computer2 (subscriber, 192.168.20.12) is sending participant info (DATA(p)) to computer1 (publisher, 192.168.20.160) in it which includes the ip address of the new interface (192.168.20.12) in the list of unicast locators (capture packet 2). However, later when computer2 sends its subscriptions (DATA(r)) to computer1 it does not send that ip address in the list of unicast locators (capture packet 12). When I noticed this I checked where those ip addresses might come from, and it seems that e.g. ReaderProxyData::writeToCDRMessage() is called only on initialization time, and when computer2 does not have an interface with address 192.168.20.12 at that time present, then it is not written to the CDR message.

Then I investigated what happens when DomainParticipant::set_qos() is called, and noticed that while RTPSParticipantImpl::update_attributes() updates a lot of things, it does not touch objects in m_userWriterList and m_userReaderList. I added my own code which calls RTPSParticipantImpl::createAndAssociateReceiversWithEndPoint() function for those mimicing the way it is called when readers and writers are created, and this seems to work if I also call Subscriber::set_qos() in addition to calling DomainParticipant::set_qos() in my test program (sub executable). I have attached my hack for RTPSParticipantImpl class (eprosima-dynamic-networks-hack.zip) in this issue as well as a new tcpdump capture with hack included (tcpdump_with_fix.zip). I am quite certain that my hack is not the correct way to fix this, but it seems to work at least with my simple test project.

eprosima-dynamic-networks-hack.zip tcpdump_with_fix.zip

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

jsnykan avatar May 11 '22 12:05 jsnykan

When I tested my hack with our software which has one process using fast-dds on computer1, and multiple processes using fast-dds on computer2, I noticed the following issues:

  • When new network is detected and DomainParticipant::set_qos() is called, RTPSParticipantImpl fetches new metatraffic and default unicast locators in update_attributes() function. Problem is, that all the new UDPv4 locators have default ports, which are not correct if other process is already using those, which happens on computer2. I added code that sets correct ports to the new UDPv4 locators (correct port can be queried from any of the existing UDPv4 locators)

  • When network cable is connected, participant on computer1 can announce itself to computer2 before the new network interface has an ip address because fast-dds sends participant info also via multicast udp messages. Now when computer1 gets the ip address for the new interface and sends an updated participant info to computer2, computer2 assigns remote endpoint for the new computer1 ip address only if static EDP is used. In our case static EDP is not an option, so I modified PDPListener.cpp so that remote endpoints are always assigned when participant update is received.

jsnykan avatar May 23 '22 12:05 jsnykan

Thanks for your detailed report @jsnykan

We are aware that the dynamic network interfaces feature is not yet fully functional. #2421 already reported a similar issue. The testing we have done so far indicates that when Fast DDS is launched with no network interface is different from launching with another network interface enabled from the beginning (even though there is no communication established over that interface). The first is not working as of now and the cause is probably one of the ones you mention in your report. We have this issue in our roadmap and we will try to solve it soon. We will keep you posted!

JLBuenoLopez avatar May 23 '22 12:05 JLBuenoLopez