zenoh-plugin-dds icon indicating copy to clipboard operation
zenoh-plugin-dds copied to clipboard

no DDS Writer after 3s - drop incoming data (broker topology - client mode)

Open JEnoch opened this issue 2 years ago • 3 comments

Discussed in https://github.com/eclipse-zenoh/roadmap/discussions/94

Originally posted by gtoff October 4, 2023 Hi,

we are running a brokered topology with one router in a K8S cluster, one remote robot, and one or more K8S pod clients (running rviz and zenoh-dds-bridge) in the same cluster. We noticed something we did not expect. If we connect in "client" mode (-m client) with the pods running rviz, we only seem to be able to connect 1 client at a time. The other gets a series of the following warning for each topic:

WARN zenoh_plugin_dds::route_zenoh_dds] Route Zenoh->DDS (rt/tf -> rt/tf): still no DDS Writer after 3s - drop incoming data!

If we switch the mode to -peer (which we don't want, we'd like communication to go over the router and no peer discovery to happen) then messages come through.

Any idea what this could be? I am not even sure I understand the warning: is the bridge complaining that it cannot create a local DDS writer?

JEnoch avatar Oct 06 '23 06:10 JEnoch

Hi @gtoff,

I transferred your question as an issue here in eclipse-zenoh/zenoh-plugin-dds, since it's related to this bridge.

The still no DDS Writer after 3s message occurs when a bridge runs in forwarding mode and discovered a DDS Reader. In this mode, it prepares a route for it, creating a Zenoh subscriber, forwarding the discovery info to the remote bridges (so they can declare a DDS Reader that will be discovered by remote ROS Nodes). But it doesn't yet creates a DDS Writer that will route data coming via Zenoh to this discovered DDS Reader. Only when a remote bridge forwards the discovery information of a DDS Writer, this route will be completed with the creation of this DDS Writer with the same QoS than the DDS Writer announced by the remote bridge.

What happens in your case is that a local DDS Reader has been discovered, but no remote bridge forwarded the discovery info of a DDS Writer. Sill, some data are received via Zenoh, are kept on hold during 3 seconds (waiting for a discovery info message, in case of order inversion), but are eventually dropped.

Are all your bridges well configured with -f or --fwd-discovery option ?

JEnoch avatar Oct 06 '23 07:10 JEnoch

Thank you @JEnoch,

so the no DDS Writer warning must be unrelated to the issue. Indeed, we run the bridge with -f option because we still want to be able to build a complete ROS graph with rqt (for teaching purposes).

To give more context, we are currently just running rqt / rviz in the k8s pods and all pods have the same hostname and will start ROS nodes with the same name. Could this be the reason why messages don't go through? We also don't see this happening with lightweight applications, but once we start with more heavyweight topics (e.g., images) only one of the clients gets the data. We are talking about peaks of 15MBps, so I think we're far from saturating the infrastructure...

gtoff avatar Oct 06 '23 07:10 gtoff

Another user also reported on our Discord some issues within a K8S environment. I reproduced his deployment and saw strange behaviour in his Gazebo pod: the bridge was discovering DDS entities only after few seconds, while in other pods it was in the order of milliseconds. This made me think that something goes wrong in the network traffic. Possibly some congestion with messages being delayed in some queue or buffer.

As far as I understood the K8S network is virtualized. This can cause a different behaviour than with a ethernet or loopback network. I'm not sure how to investigate this, but will try to after the ROSCon.

JEnoch avatar Oct 09 '23 09:10 JEnoch