ros_gz
ros_gz copied to clipboard
ROS -> IGN stops working after 5 minutes of inactivity
I have noticed this problem in the SubT environment, but it can be easily reproduced as follows:
-
Start
roscore
-
In one terminal (Terminal 1), run the parameter bridge:
rosrun ros1_ign_bridge parameter_bridge "test@std_msgs/String]ignition.msgs.StringMsg"
-
In another terminal (Terminal 2), echo the ign-transport topic
ign topic -e -t /test
-
In another terminal (Terminal 3), publish on the ROS topic
rostopic pub -1 /test std_msgs/String "data: 'Hello1'"
-
On Terminal 2, you should see
data: "Hello1"
-
Now wait 5 minutes without publishing anything and then, on Terminal 3, run
rostopic pub -1 /test std_msgs/String "data: 'Hello2'"
-
You expect to see
data: "Hello2"
but instead, you'll get nothing.
This is a strange behavior since neither ROS or ign-transport have this problem when used directly to send messages to other ros and ign-transport nodes respectively.
The issue seems to be related to having docker installed/running. Others may not experience this problem. We can put this issue on hold for now.
I just tested this. I don't have docker running so I was still able to get the msg Hello2
after 5 mins
I just tested this inside a docker container and the 2nd message went through even after 10 minutes.
Okay. This is really strange. I tested it inside the latest osrf/subt-virtual-testbed
yesterday and I was still getting the problem. At this point, I'm inclined to think it must be something wrong with my machine/network setup.
For the record, Addisu and I tested a couple of scenarios:
- I ran it on my desktop inside a Docker container (Docker version 18.09) and I wasn't able to reproduce the problem.
- I ran it on my laptop in the host I wasn't able to reproduce the problem.
- In both my laptop and in Addisu's computer, Ignition Transport binds to 172.17.0.1, which belongs to the Docker network interface. You can check this by running the bridge or the listener with
IGN_VERBOSE=1
. - If Addisu forces Ignition Transport to bind to another network interface (e.g.: IGN_IP=127.0.0.1) the problem disappears.
- It doesn't make any difference if the computer is connected to a LAN or isolated.
- It doesn't make any difference if you run another listener after the 5-6 minute mark. The first listener still misses the message (in Addisu's computer).
Thanks for summarizing our tests @caguero
This is a capture from wireshark only showing the ign-transport side of things. It shows that the first message was sent from the ip 172.17.0.1
and receives an ACK. The second message after 305 seconds is sent from 192.168.1.74
, but it doesn't get an ACK. I'm not sure why the second message is sending from a different IP.
This might be a related issue: https://github.com/zeromq/libzmq/issues/2763
Based on the comments there, I tried setting ZMQ_HEARTBEAT_IVL
to 30 seconds and that seems to fix the problem for me. Again, since I'm the only one experiencing it and since I don't know exactly what's causing the problem, we can wait on making any changes. This is my diff on ign-transport/src/NodeShared.cc
, fyi
diff --git a/src/NodeShared.cc b/src/NodeShared.cc
--- a/src/NodeShared.cc
+++ b/src/NodeShared.cc
@@ -939,17 +939,23 @@ void NodeShared::OnNewConnection(const M
{
try
{
// Handle security
this->dataPtr->SecurityOnNewConnection();
// I am not connected to the process.
if (!this->connections.HasPublisher(addr))
+ {
+ // Heartbeat every 30 seconds
+ int heartBeatVal = 30000;
+ this->dataPtr->subscriber->setsockopt(ZMQ_HEARTBEAT_IVL,
+ &heartBeatVal, sizeof(heartBeatVal));
this->dataPtr->subscriber->connect(addr.c_str());
+ }
// Add a new filter for the topic.
this->dataPtr->subscriber->setsockopt(ZMQ_SUBSCRIBE,
topic.data(), topic.size());
// Register the new connection with the publisher.
this->connections.AddPublisher(_pub);