Fast-DDS icon indicating copy to clipboard operation
Fast-DDS copied to clipboard

[Shared Memory] Subscriber won't reconnect after crash under specific circumstances.

Open duchengyao opened this issue 3 years ago • 7 comments

Is there an already existing issue for this?

  • [X] I have searched the existing issues

Expected behavior

Subscriber should receive data after unexpected restart.

Current behavior

Subscriber won't reconnect after crash.

Steps to reproduce

  1. Open two different consoles.
  2. First console launch ./BasicConfigurationExample publisher --transport=shm.
  3. Second console use GDB running ./BasicConfigurationExample subscriber --transport=shm, and quit unexpected.
  4. Rerun subscriber, and then no data received.

Full log in the second console:

┌──(s1nh㉿s1nh-ThinkPad)-[~/…/examples/cpp/dds/BasicConfigurationExample]
└─$ ./BasicConfigurationExample  subscriber --transport=shm                                                                                                                                                                                                                130 ⨯
Subscriber running. Please press CTRL+C to stop the Subscriber.
Subscriber matched.
Message HelloWorld  246 RECEIVED
Message HelloWorld  247 RECEIVED
Message HelloWorld  248 RECEIVED
Message HelloWorld  249 RECEIVED
Message HelloWorld  250 RECEIVED
^CSIGINT received, stopping Subscriber execution.
                                                                                                                                                                                                                                                                                 
┌──(s1nh㉿s1nh-ThinkPad)-[~/…/examples/cpp/dds/BasicConfigurationExample]
└─$ gdb BasicConfigurationExample                           
GNU gdb (Ubuntu 12.0.90-0ubuntu1) 12.0.90
...
Reading symbols from BasicConfigurationExample...
(gdb) b BasicConfigurationSubscriber.cpp:99
Breakpoint 1 at 0x5f599: file /home/s1nh/project/Fast-DDS/examples/cpp/dds/BasicConfigurationExample/BasicConfigurationSubscriber.cpp, line 99.
(gdb) r subscriber --transport=shm
Starting program: /home/s1nh/project/Fast-DDS/cmake-build-debug/examples/cpp/dds/BasicConfigurationExample/BasicConfigurationExample subscriber --transport=shm
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff605f640 (LWP 51986)]
[New Thread 0x7ffff57d7640 (LWP 51987)]
[New Thread 0x7ffff4fd6640 (LWP 51988)]
[New Thread 0x7ffff47d5640 (LWP 51989)]
[New Thread 0x7ffff3fd4640 (LWP 51990)]
[New Thread 0x7ffff3733640 (LWP 51991)]

Thread 1 "BasicConfigurat" hit Breakpoint 1, HelloWorldSubscriber::init (this=0x7fffffffda50, topic_name="HelloWorldTopic", max_messages=0, domain=0, transport=SHM, reliable=false, transient=false) at /home/s1nh/project/Fast-DDS/examples/cpp/dds/BasicConfigurationExample/BasicConfigurationSubscriber.cpp:99
99	    if (participant_ == nullptr)
(gdb) q
A debugging session is active.

	Inferior 1 [process 51983] will be killed.

Quit anyway? (y or n) y
                                                                                                                                                                                                                                                                                 
┌──(s1nh㉿s1nh-ThinkPad)-[~/…/examples/cpp/dds/BasicConfigurationExample]
└─$ ./BasicConfigurationExample  subscriber --transport=shm
Subscriber running. Please press CTRL+C to stop the Subscriber.

<Nothing received>

Fast DDS version/commit

since https://github.com/eProsima/Fast-DDS/commit/e58dcb17e6732b827b18056520c1137b807d800f

Platform/Architecture

Ubuntu Focal 20.04 / 22.04 amd64

Transport layer

Shared Memory

duchengyao avatar Jul 06 '22 03:07 duchengyao

https://github.com/eProsima/Fast-DDS/issues/2003#issuecomment-1160245640

https://github.com/boostorg/interprocess/commit/7c8893788d3823615d6ddbfd125d2ad153817111

duchengyao avatar Jul 07 '22 07:07 duchengyao

More information provided, hope to get some attention.

duchengyao avatar Sep 01 '22 07:09 duchengyao

Since https://github.com/eProsima/Fast-DDS/commit/e58dcb17e6732b827b18056520c1137b807d800f (In squashed commit Refs #8250. Do not reuse zombie ports structures. @adolfomarver authored and @MiguelCompany committed on May 12, 2020)

During subscriber paused, when publisher shows

2022-09-06 14:10:47.049 [RTPS_TRANSPORT_SHM Warning] SHM Port 7413 failure: the port is marked as not ok! -> Function try_push
2022-09-06 14:10:47.049 [RTPS_TRANSPORT_SHM Warning] (ID:140091162883648) Existing Port 7413 (5f4eeb5613a33705) NOT Healthy. -> Function open_port_internal
2022-09-06 14:10:47.049 [RTPS_TRANSPORT_SHM Warning] (ID:140091162883648) Port 7413 (5f4eeb5613a33705) Removed. -> Function open_port_internal
Message: HelloWorld with index: 157 SENT
2022-09-06 14:10:47.246 [RTPS_TRANSPORT_SHM Warning] SHM Port 7412 failure: the port is marked as not ok! -> Function try_push
2022-09-06 14:10:47.248 [RTPS_TRANSPORT_SHM Warning] (ID:140091402036800) Existing Port 7412 (3e1fa9f9eb2b0ade) NOT Healthy. -> Function open_port_internal
2022-09-06 14:10:47.248 [RTPS_TRANSPORT_SHM Warning] (ID:140091402036800) Port 7412 (3e1fa9f9eb2b0ade) Removed. -> Function open_port_internal

quit and restart subscriber, no data received. Subscriber shows:

2022-09-07 14:50:27.441 [RTPS_TRANSPORT_SHM Warning] (ID:140737325877056) Port 7414 Zombie. Reset the port -> Function open_port_internal
2022-09-07 14:50:27.442 [RTPS_TRANSPORT_SHM Warning] (ID:140737325877056) Port 7415 Zombie. Reset the port -> Function open_port_internal

Before that commit, subscriber work well.

duchengyao avatar Sep 06 '22 07:09 duchengyao

Modify these code from

https://github.com/eProsima/Fast-DDS/blob/293ff25b2233344bd7f1ec4600674c73af731787/src/cpp/utils/shared_memory/RobustExclusiveLock.hpp#L170-L184

to

//        if (fd != -1)
//        {
//            *was_lock_created = false;
//        }
//        else
//        {
            *was_lock_created = true;
            fd = open(file_path.c_str(), O_CREAT | O_RDONLY, 0666);
//        }

will temporarily avoid the problem

duchengyao avatar Sep 07 '22 08:09 duchengyao

@duchengyao Thank you very much for taking the time in investigating this ...

I think the issue will arise if the process crashes between opening and locking the file.

We would need to think of an atomic way to perform both things at the same time, as it is done on windows above in the code.

MiguelCompany avatar Sep 07 '22 14:09 MiguelCompany

Thanks for reply. I'd like to integrate fastdds to my project. Will there be any side effects if I comment out that code above? Regards,

duchengyao avatar Sep 09 '22 01:09 duchengyao

@MiguelCompany

We would need to think of an atomic way to perform both things at the same time, as it is done on windows above in the code.

I have tried in windows 11, unfortunately the same issue. Even worse, subscriber won't reconnect after publisher restart.

duchengyao avatar Sep 14 '22 07:09 duchengyao

@MiguelCompany

Is there a plan for which version this issue will be resolved? I had a problem similar to this one.

zyqhhhd avatar Nov 07 '22 08:11 zyqhhhd

@duchengyao @zyqhhhd Would you mind checking with the latest release v2.9.1?

Additionally, could you check if using fastdds shm clean before restarting the application that crashed solves the issue?

MiguelCompany avatar Jan 31 '23 15:01 MiguelCompany

I have tested using fastdds shm clean before. It dosen't work.

duchengyao avatar Feb 02 '23 01:02 duchengyao

The latest version not work. I recorded the video. At 00:54 the subscriber not working, until restart publisher.

screencast.webm

duchengyao avatar Feb 02 '23 02:02 duchengyao

Modify these code from

https://github.com/eProsima/Fast-DDS/blob/master/src/cpp/utils/shared_memory/RobustExclusiveLock.hpp#L170-L184

to

//        if (fd != -1)
//        {
//            *was_lock_created = false;
//        }
//        else
//        {
            *was_lock_created = true;
            fd = open(file_path.c_str(), O_CREAT | O_RDONLY, 0666);
//        }

will temporarily avoid the problem

I also made a video.

In my branch: https://github.com/duchengyao/Fast-DDS/tree/temporarily-avoid-deadlock

And this method works in the latest version.

Screencast from 02-02-2023 10:31:54 AM.webm

duchengyao avatar Feb 02 '23 02:02 duchengyao

Hope it helps you.

duchengyao avatar Feb 02 '23 02:02 duchengyao

Hi @duchengyao

Fast DDS latest release has some fixes in order to improve the behavior in SHM reconnections (#3639, #3640, and #3642). Could you check again against Fast DDS v2.11.1? These fixes are being backported to the Fast DDS alive branches.

JLBuenoLopez avatar Jul 21 '23 05:07 JLBuenoLopez

This issue has been closed, but it does NOT mean that the problem has been solved.

I'm currently no longer using SHM transport and don't have time to test it yet. The testing method has been clearly written above, and anyone can test it by themselves.

duchengyao avatar Dec 11 '23 09:12 duchengyao