Fast-DDS
                                
                                 Fast-DDS copied to clipboard
                                
                                    Fast-DDS copied to clipboard
                            
                            
                            
                        [Shared Memory] Subscriber won't reconnect after crash under specific circumstances.
Is there an already existing issue for this?
- [X] I have searched the existing issues
Expected behavior
Subscriber should receive data after unexpected restart.
Current behavior
Subscriber won't reconnect after crash.
Steps to reproduce
- Open two different consoles.
- First console launch ./BasicConfigurationExample publisher --transport=shm.
- Second console use GDB running ./BasicConfigurationExample subscriber --transport=shm, and quit unexpected.
- Rerun subscriber, and then no data received.
Full log in the second console:
┌──(s1nh㉿s1nh-ThinkPad)-[~/…/examples/cpp/dds/BasicConfigurationExample]
└─$ ./BasicConfigurationExample  subscriber --transport=shm                                                                                                                                                                                                                130 ⨯
Subscriber running. Please press CTRL+C to stop the Subscriber.
Subscriber matched.
Message HelloWorld  246 RECEIVED
Message HelloWorld  247 RECEIVED
Message HelloWorld  248 RECEIVED
Message HelloWorld  249 RECEIVED
Message HelloWorld  250 RECEIVED
^CSIGINT received, stopping Subscriber execution.
                                                                                                                                                                                                                                                                                 
┌──(s1nh㉿s1nh-ThinkPad)-[~/…/examples/cpp/dds/BasicConfigurationExample]
└─$ gdb BasicConfigurationExample                           
GNU gdb (Ubuntu 12.0.90-0ubuntu1) 12.0.90
...
Reading symbols from BasicConfigurationExample...
(gdb) b BasicConfigurationSubscriber.cpp:99
Breakpoint 1 at 0x5f599: file /home/s1nh/project/Fast-DDS/examples/cpp/dds/BasicConfigurationExample/BasicConfigurationSubscriber.cpp, line 99.
(gdb) r subscriber --transport=shm
Starting program: /home/s1nh/project/Fast-DDS/cmake-build-debug/examples/cpp/dds/BasicConfigurationExample/BasicConfigurationExample subscriber --transport=shm
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff605f640 (LWP 51986)]
[New Thread 0x7ffff57d7640 (LWP 51987)]
[New Thread 0x7ffff4fd6640 (LWP 51988)]
[New Thread 0x7ffff47d5640 (LWP 51989)]
[New Thread 0x7ffff3fd4640 (LWP 51990)]
[New Thread 0x7ffff3733640 (LWP 51991)]
Thread 1 "BasicConfigurat" hit Breakpoint 1, HelloWorldSubscriber::init (this=0x7fffffffda50, topic_name="HelloWorldTopic", max_messages=0, domain=0, transport=SHM, reliable=false, transient=false) at /home/s1nh/project/Fast-DDS/examples/cpp/dds/BasicConfigurationExample/BasicConfigurationSubscriber.cpp:99
99	    if (participant_ == nullptr)
(gdb) q
A debugging session is active.
	Inferior 1 [process 51983] will be killed.
Quit anyway? (y or n) y
                                                                                                                                                                                                                                                                                 
┌──(s1nh㉿s1nh-ThinkPad)-[~/…/examples/cpp/dds/BasicConfigurationExample]
└─$ ./BasicConfigurationExample  subscriber --transport=shm
Subscriber running. Please press CTRL+C to stop the Subscriber.
<Nothing received>
Fast DDS version/commit
since https://github.com/eProsima/Fast-DDS/commit/e58dcb17e6732b827b18056520c1137b807d800f
Platform/Architecture
Ubuntu Focal 20.04 / 22.04 amd64
Transport layer
Shared Memory
https://github.com/eProsima/Fast-DDS/issues/2003#issuecomment-1160245640
https://github.com/boostorg/interprocess/commit/7c8893788d3823615d6ddbfd125d2ad153817111
More information provided, hope to get some attention.
Since https://github.com/eProsima/Fast-DDS/commit/e58dcb17e6732b827b18056520c1137b807d800f (In squashed commit Refs #8250. Do not reuse zombie ports structures. @adolfomarver authored and @MiguelCompany committed on May 12, 2020)
During subscriber paused, when publisher shows
2022-09-06 14:10:47.049 [RTPS_TRANSPORT_SHM Warning] SHM Port 7413 failure: the port is marked as not ok! -> Function try_push
2022-09-06 14:10:47.049 [RTPS_TRANSPORT_SHM Warning] (ID:140091162883648) Existing Port 7413 (5f4eeb5613a33705) NOT Healthy. -> Function open_port_internal
2022-09-06 14:10:47.049 [RTPS_TRANSPORT_SHM Warning] (ID:140091162883648) Port 7413 (5f4eeb5613a33705) Removed. -> Function open_port_internal
Message: HelloWorld with index: 157 SENT
2022-09-06 14:10:47.246 [RTPS_TRANSPORT_SHM Warning] SHM Port 7412 failure: the port is marked as not ok! -> Function try_push
2022-09-06 14:10:47.248 [RTPS_TRANSPORT_SHM Warning] (ID:140091402036800) Existing Port 7412 (3e1fa9f9eb2b0ade) NOT Healthy. -> Function open_port_internal
2022-09-06 14:10:47.248 [RTPS_TRANSPORT_SHM Warning] (ID:140091402036800) Port 7412 (3e1fa9f9eb2b0ade) Removed. -> Function open_port_internal
quit and restart subscriber, no data received. Subscriber shows:
2022-09-07 14:50:27.441 [RTPS_TRANSPORT_SHM Warning] (ID:140737325877056) Port 7414 Zombie. Reset the port -> Function open_port_internal
2022-09-07 14:50:27.442 [RTPS_TRANSPORT_SHM Warning] (ID:140737325877056) Port 7415 Zombie. Reset the port -> Function open_port_internal
Before that commit, subscriber work well.
Modify these code from
https://github.com/eProsima/Fast-DDS/blob/293ff25b2233344bd7f1ec4600674c73af731787/src/cpp/utils/shared_memory/RobustExclusiveLock.hpp#L170-L184
to
//        if (fd != -1)
//        {
//            *was_lock_created = false;
//        }
//        else
//        {
            *was_lock_created = true;
            fd = open(file_path.c_str(), O_CREAT | O_RDONLY, 0666);
//        }
will temporarily avoid the problem
@duchengyao Thank you very much for taking the time in investigating this ...
I think the issue will arise if the process crashes between opening and locking the file.
We would need to think of an atomic way to perform both things at the same time, as it is done on windows above in the code.
Thanks for reply. I'd like to integrate fastdds to my project. Will there be any side effects if I comment out that code above? Regards,
@MiguelCompany
We would need to think of an atomic way to perform both things at the same time, as it is done on windows above in the code.
I have tried in windows 11, unfortunately the same issue. Even worse, subscriber won't reconnect after publisher restart.
@MiguelCompany
Is there a plan for which version this issue will be resolved? I had a problem similar to this one.
@duchengyao @zyqhhhd Would you mind checking with the latest release v2.9.1?
Additionally, could you check if using fastdds shm clean before restarting the application that crashed solves the issue?
I have tested using fastdds shm clean before. It dosen't work.
The latest version not work. I recorded the video. At 00:54 the subscriber not working, until restart publisher.
Modify these code from
https://github.com/eProsima/Fast-DDS/blob/master/src/cpp/utils/shared_memory/RobustExclusiveLock.hpp#L170-L184
to
// if (fd != -1) // { // *was_lock_created = false; // } // else // { *was_lock_created = true; fd = open(file_path.c_str(), O_CREAT | O_RDONLY, 0666); // }will temporarily avoid the problem
I also made a video.
In my branch: https://github.com/duchengyao/Fast-DDS/tree/temporarily-avoid-deadlock
And this method works in the latest version.
Hope it helps you.
Hi @duchengyao
Fast DDS latest release has some fixes in order to improve the behavior in SHM reconnections (#3639, #3640, and #3642). Could you check again against Fast DDS v2.11.1? These fixes are being backported to the Fast DDS alive branches.
This issue has been closed, but it does NOT mean that the problem has been solved.
I'm currently no longer using SHM transport and don't have time to test it yet. The testing method has been clearly written above, and anyone can test it by themselves.