Fast-DDS icon indicating copy to clipboard operation
Fast-DDS copied to clipboard

the subscriber cannot receive data normally [12226]

Open libfsw opened this issue 3 years ago • 13 comments

Expected Behavior

Current Behavior

In one case, the publisher has been sending data, and the subscriber is receiving data. Suddenly closing the receiver and starting the program again, the subscriber cannot receive data normally.

Steps to Reproduce

System information

  • Fast-RTPS version:
  • OS:win7-x64
  • Network interfaces:
  • ROS2:

Additional context

Additional resources

  • Wireshark capture
  • XML profiles file

libfsw avatar Jul 18 '21 16:07 libfsw

In one case, the publisher has been sending data, and the subscriber is receiving data. Suddenly closing the receiver and starting the program again, the subscriber cannot receive data normally.

Which case? How to reproduce?

With so little information there's not much we can do to help.

MiguelCompany avatar Jul 20 '21 05:07 MiguelCompany

Fast-RTPS version:2.3.3 OS:win7-w32\win10-win32 demo.zip

libfsw avatar Jul 25 '21 13:07 libfsw

When I suddenly stopped the subscriber while I was communicating and started the subscriber again, I found that I couldn't communicate properly.

libfsw avatar Jul 25 '21 13:07 libfsw

Sometimes it's a subscription to match but no data communication, sometimes it's a subscription that doesn't

libfsw avatar Jul 25 '21 14:07 libfsw

When I publish multiple topics in a program, no exception was found when I started the subscription program for the first time, but when I closed the subscription program and started the program again, I might find that some topic subscriptions failed. After analysis, it is found that if the program is suddenly closed when the publisher and the subscriber are communicating, the subscriber will fail to subscribe the next time it is started.

libfsw avatar Jul 25 '21 14:07 libfsw

  1. Why does the above situation exist? What caused this?
  2. How can I successfully subscribe again after the program exits abnormally?

libfsw avatar Jul 25 '21 15:07 libfsw

IDL: struct SimMessage { unsigned long id; unsigned long long time_ms; string dest; string src; string type; string subtype; sequence<octet, 24000> data; };

libfsw avatar Jul 25 '21 15:07 libfsw

I found that the following code can significantly reduce the probability of this failure, but the failure still occurs after many attempts. But I don't know the reason, can you help me?

pqos.wire_protocol().builtin.readerPayloadSize = 1024 * 1024 * 1; pqos.wire_protocol().builtin.readerHistoryMemoryPolicy = DYNAMIC_REUSABLE_MEMORY_MODE; pqos.wire_protocol().builtin.writerPayloadSize = 1024 * 5; pqos.wire_protocol().builtin.writerHistoryMemoryPolicy = DYNAMIC_REUSABLE_MEMORY_MODE;

libfsw avatar Jul 27 '21 16:07 libfsw

@libfsw Thank you for the additional information and the demo code.

If modifying the configuration of the builtin protocols is making things work better, it should then be related to the discovery / matching of either participants or endpoints.

I will take a look and see if I can reproduce.

MiguelCompany avatar Jul 28 '21 05:07 MiguelCompany

Hey @libfsw and @MiguelCompany, the same problem also frequently occures on my system (Win10-64bit, FastDDS 2.3.0 and 2.3.3, all nodes on the same machine). I have a bunch of publishers and subscribers running and want to regulary inspect the topics with my ImageViewer-Node. If i start all nodes at a similar time, the ImageViewer-Node is able to find all participants and can receive messages. After some random time and restarts of the Viewer, it is able to find the other participants, sometimes even their topics, but does not receive any message.

I have used the DDS/HelloWorldExample project for testing, made it publish infinitely and added a custom DomainParticipantListener to be able to see what has been found by the Participant.

I did the following tests:

  1. Making the subscriber crash after n-messages by printing the value of a nullptr (directly after taking the sample from the reader). Then restart the subscriber. Both nodes found each other, found the publisher and matched, but the subscriber node does not receive anything. The only solution is to also restart the publisher.
  2. Continuously starting the subscriber and closing it after the first messages have been received. It takes 10-150 attempts until i get the same effect as crashing the subscriber, sometimes it still works after 150 attempts which makes it difficult to reproduce.

I have experimented with the discoveryconfig settings, but it did not fix the issue and only reduced its occurrence.

PFrieling avatar Aug 11 '21 13:08 PFrieling

I also found same issue with 2.5.0, if subscriber after crashed or assert, restart the subscriber, it can't receive any topic event anymore. anyone has fixed it?

MikeChie avatar Feb 14 '22 10:02 MikeChie

My current solution is to disable shared memory and only allow the TCPv4 and UDPv4 transport descriptor:

fastdds::dds::DomainParticipantQos pqos;

//Set your other DomainParticipant Qos settings here...

pqos.transport().use_builtin_transports = false;
auto tcp_descriptor = std::make_shared<fastdds::rtps::TCPv4TransportDescriptor>();
pqos.transport().user_transports.push_back(tcp_descriptor);
auto udp_transport = std::make_shared<fastdds::rtps::UDPv4TransportDescriptor>();
pqos.transport().user_transports.push_back(udp_transport);
participant = fastdds::dds::DomainParticipantFactory::get_instance()->create_participant(0, pqos, this, fastdds::dds::StatusMask::none());

With this solution, all my participants do always reconnect and nothing gets corrupt. I did not see any performance downgrades after disabling shared memory, so i'am happy with this setting.

PFrieling avatar Feb 14 '22 12:02 PFrieling

Same issue here, with TOT or 2.6.0 version. I have to use sharedmemory and tested with the suggestion above (many thanks!) and even sharedmemory works with some dropped messages/buffers.

keith4ever avatar May 12 '22 17:05 keith4ever

Recently, several improvements related to SHM reconnection have been merged (#3639, #3640, and #3642). @libfsw, @PFrieling and @keith4ever could you please check if the issue reported has been fixed/mitigated?

JLBuenoLopez avatar Jul 20 '23 07:07 JLBuenoLopez

According to our CONTRIBUTING.md guidelines, I am closing this issue due to inactivity. Please, feel free to reopen it if necessary.

Mario-DL avatar Aug 29 '23 11:08 Mario-DL