zenoh icon indicating copy to clipboard operation
zenoh copied to clipboard

Responses to a query are sometimes not received

Open robojan opened this issue 11 months ago • 4 comments

Describe the bug

Sometimes the responses to queries are not received, when this happens the following message is logged in the zenoh log:

rx-1 ThreadId(07) zenoh::net::routing::dispatcher::queries: Route reply Face{1, 2b2c6100603ba8e1238a98fda14685ce}:12 from Face{1, 2b2c6100603ba8e1238a98fda14685ce}: Query not found!

After doing some digging it seems to be related to reordering of packets. Initially I was using UDP, where I had this problem a lot more often, but this problem is still present when using TCP only less often. Normally the order of network packets is request -> response -> responseFinal, but when it goes wrong the order is request -> responseFinal -> response

Attached is a wireshark trace and the corresponding log. Note that I am using zenoh here with TCP and UDP transports on port 7200. zenoh_reorder_filtered.txt zenoh_reorder_filtered.pcapng.gz

To reproduce

I have a reliable way to trigger this, but I cannot easily share it. But my situation is as follows.

  • I have a zenoh peer that has 16 publishers with publication cache.
  • Then I have another zenoh peer that subscribes on all 16 publishers with a querying subscriber
  • When the second peer joins it makes a lot of requests to get the latest data, here sometimes a query fails. On my x86 machine I see it almost never fail, but on the jetson it fails about 50% of the runs.

System info

  • Platform: Nvidia jetson Orin NX
  • CPU: 8 core ARM64
  • Zenoh version: 1.1.0

robojan avatar Jan 21 '25 08:01 robojan

Out of curiosity are you testing rmw_zenoh or a pure Zenoh system? We have observed something similar in rmw_zenoh that was caused by a misconfiguration of the congestion control parameter for replies.

Mallets avatar Mar 05 '25 11:03 Mallets

What binding (programing language) are you using ? If not rust, it might be solved by some recent fixes (zenoh-c https://github.com/eclipse-zenoh/zenoh-c/commit/261493682c7dc54db3a07079315e009a2e7c1573)

OlivierHecart avatar Mar 05 '25 18:03 OlivierHecart

Out of curiosity are you testing rmw_zenoh or a pure Zenoh system? We have observed something similar in rmw_zenoh that was caused by a misconfiguration of the congestion control parameter for replies.

I'm using a pure zenoh system.

What binding (programing language) are you using ? If not rust, it might be solved by some recent fixes (zenoh-c eclipse-zenoh/zenoh-c@2614936)

I'm writing a C++ library that exposes a ROS network on one side to an C++ API on the other side. So I'm using the C++ wrapper around the C bindings. I will check if the problems are gone soon with the latest version.

robojan avatar Mar 06 '25 09:03 robojan

Hello @robojan, could you please tell me if the issue still actual?

sashacmc avatar Oct 03 '25 15:10 sashacmc