Responses to a query are sometimes not received
Describe the bug
Sometimes the responses to queries are not received, when this happens the following message is logged in the zenoh log:
rx-1 ThreadId(07) zenoh::net::routing::dispatcher::queries: Route reply Face{1, 2b2c6100603ba8e1238a98fda14685ce}:12 from Face{1, 2b2c6100603ba8e1238a98fda14685ce}: Query not found!
After doing some digging it seems to be related to reordering of packets. Initially I was using UDP, where I had this problem a lot more often, but this problem is still present when using TCP only less often. Normally the order of network packets is request -> response -> responseFinal, but when it goes wrong the order is request -> responseFinal -> response
Attached is a wireshark trace and the corresponding log. Note that I am using zenoh here with TCP and UDP transports on port 7200. zenoh_reorder_filtered.txt zenoh_reorder_filtered.pcapng.gz
To reproduce
I have a reliable way to trigger this, but I cannot easily share it. But my situation is as follows.
- I have a zenoh peer that has 16 publishers with publication cache.
- Then I have another zenoh peer that subscribes on all 16 publishers with a querying subscriber
- When the second peer joins it makes a lot of requests to get the latest data, here sometimes a query fails. On my x86 machine I see it almost never fail, but on the jetson it fails about 50% of the runs.
System info
- Platform: Nvidia jetson Orin NX
- CPU: 8 core ARM64
- Zenoh version: 1.1.0
Out of curiosity are you testing rmw_zenoh or a pure Zenoh system? We have observed something similar in rmw_zenoh that was caused by a misconfiguration of the congestion control parameter for replies.
What binding (programing language) are you using ? If not rust, it might be solved by some recent fixes (zenoh-c https://github.com/eclipse-zenoh/zenoh-c/commit/261493682c7dc54db3a07079315e009a2e7c1573)
Out of curiosity are you testing rmw_zenoh or a pure Zenoh system? We have observed something similar in rmw_zenoh that was caused by a misconfiguration of the congestion control parameter for replies.
I'm using a pure zenoh system.
What binding (programing language) are you using ? If not rust, it might be solved by some recent fixes (zenoh-c eclipse-zenoh/zenoh-c@2614936)
I'm writing a C++ library that exposes a ROS network on one side to an C++ API on the other side. So I'm using the C++ wrapper around the C bindings. I will check if the problems are gone soon with the latest version.
Hello @robojan, could you please tell me if the issue still actual?