UCX ignores Linux policy routing inside netns and always uses `main` routing table
Background
We run RDMA workloads inside Kubernetes pods.
On the host, we have several bonded RoCE devices (bond0…bondN).
Inside each pod’s network namespace we expose these via ipvlan/macvlan, so the pod sees multiple RDMA-capable interfaces (e.g. bond0/bond1 + mlx5_bond_0/mlx5_bond_1 etc).
To steer pod-to-pod RDMA traffic over the correct underlay path, we use Linux policy routing (multiple routing tables + ip rule) in the pod netns.
A simplified ip rule inside the pod looks like this (IPs anonymized):
# ip rule
0: from all lookup local
3000: from xxx.xxx.xxx.30 lookup 129
3000: from all oif bond0 lookup 129
32766: from all lookup main
32767: from all lookup default
- Table
129contains the correct routes for traffic that should go out viabond0. - In practice we have similar rules/tables for multiple bonds, but this one pair is enough to illustrate the problem.
What works (verbs perftest + NCCL)
With the above setup:
ib_write_bw/ib_read_bwbetween two pods on different nodes honor the policy routing and work correctly.- NCCL tests (also using verbs) behave the same way: traffic follows the
ip rule+ table129path and pod-to-pod RDMA connectivity is stable.
So policy routing in the pod netns is working as expected for other RDMA stacks.
How we run UCX
We then run ucx_perftest between two pods on different nodes, using a specific RoCE device and GID index:
Server pod:
UCX_NET_DEVICES=mlx5_bond_1:1 \
UCX_IB_GID_INDEX=7 \
ucx_perftest -c 0
Client pod:
UCX_NET_DEVICES=mlx5_bond_1:1 \
UCX_IB_GID_INDEX=7 \
ucx_perftest <server_pod_ip> -t tag_lat -c 1
(mlx5_bond_1 corresponds to the RoCE device attached to bond1 in this pod.)
Problem / observed behavior
With UCX:
-
Traffic appears to be routed according to the
mainrouting table, not according toip rule+ table129/other per-bond tables. -
If we intentionally break the route in
mainbut keep a valid route in table129, then:ib_write_bwstill succeeds (policy routing is honored),- but
ucx_perftestfails or chooses the wrong path.
-
Removing the
maintable route makesucx_perftestfail even when a correct route exists in the per-bond table selected byip rule.
From the pod’s point of view UCX behaves as if Linux policy routing does not exist and only the main table is consulted, while other tools in the same netns do respect the policy routing.
Hypothesis
Our working hypothesis is:
-
UCX correctly selects the HCA/device (
mlx5_bond_1:1) internally, -
but when creating the underlying TCP or rdma_cm sockets, it does not:
- bind a specific source IP, nor
- use
SO_BINDTODEVICE(or otherwise setoifin the flow).
As a result, from the Linux kernel’s FIB / policy routing point of view the flow looks like:
from 0.0.0.0, oif=0, to <server_pod_ip>
This does not match either:
from xxx.xxx.xxx.30 lookup 129
from all oif bond0 lookup 129
so the kernel falls through to:
32766: from all lookup main
This would explain why:
- verbs perftest and NCCL respect
ip rule(they either bind the source or operate at a different layer), - whereas UCX traffic inside the same netns always behaves as if only the
maintable exists.
Expected behavior
From a container / Kubernetes perspective, the expectation is:
-
When UCX runs inside a network namespace, it should allow the Linux kernel to apply the full routing policy of that netns (including
ip ruleand non-maintables), or -
At least offer an option to integrate with policy routing, e.g.:
- bind sockets to the chosen device (
SO_BINDTODEVICE), and/or - bind sockets to an IP address associated with the selected UCX device,
- bind sockets to the chosen device (
so that from / oif-based ip rule can be triggered correctly and UCX traffic follows the same path as other RDMA users in that namespace.
Questions
- Is UCX currently expected to work with Linux policy routing (
ip rule+ multiple routing tables) inside a container network namespace? - Does UCX intentionally only consult the
maintable for its reachability checks and endpoint creation, ignoring non-maintables? - Is there any recommended configuration today to make UCX respect per-device policy routing in containerized, multi-bond environments?
Hi @Xunzhuo,
Currently (since version 1.19.1) UCX pulls all the tables, not only main. However, it doesn't follow ip rule at all, but instead treats all routes as equal regardless of which table they came from. The PR mentioned above is going to add prioritization for routes with longer netmasks (more specific ones).
If we assume that other tables (such as 129 in your case) include more specific routes, this PR should solve the issue, even if implicitly. But if this is not the case, and there might be less specific routes in tables that should be prioritized by ip rule, we should consider adding support for policy routing.
Another thing, even if UCX chooses the correct device, it currently only binds the socket to its IP address, without using SO_BINDTODEVICE.
Thank you @amastbaum !
Yes, I think the issue we are facing here is mainly about lacking of the support of policy routing. Any plans to support this? In container/kubernetes environment, policy routing is widely adopted in production.
@yosefe we might need to add policy routing support. What do you think?
Per offline discussion with @amastbaum , we'll check the possibility of some WA here.