ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCX ignores Linux policy routing inside netns and always uses `main` routing table

Open Xunzhuo opened this issue 1 month ago • 4 comments

Background

We run RDMA workloads inside Kubernetes pods. On the host, we have several bonded RoCE devices (bond0bondN). Inside each pod’s network namespace we expose these via ipvlan/macvlan, so the pod sees multiple RDMA-capable interfaces (e.g. bond0/bond1 + mlx5_bond_0/mlx5_bond_1 etc).

To steer pod-to-pod RDMA traffic over the correct underlay path, we use Linux policy routing (multiple routing tables + ip rule) in the pod netns.

A simplified ip rule inside the pod looks like this (IPs anonymized):

# ip rule
0:      from all lookup local
3000:   from xxx.xxx.xxx.30 lookup 129
3000:   from all oif bond0 lookup 129
32766:  from all lookup main
32767:  from all lookup default
  • Table 129 contains the correct routes for traffic that should go out via bond0.
  • In practice we have similar rules/tables for multiple bonds, but this one pair is enough to illustrate the problem.

What works (verbs perftest + NCCL)

With the above setup:

  • ib_write_bw / ib_read_bw between two pods on different nodes honor the policy routing and work correctly.
  • NCCL tests (also using verbs) behave the same way: traffic follows the ip rule + table 129 path and pod-to-pod RDMA connectivity is stable.

So policy routing in the pod netns is working as expected for other RDMA stacks.

How we run UCX

We then run ucx_perftest between two pods on different nodes, using a specific RoCE device and GID index:

Server pod:

UCX_NET_DEVICES=mlx5_bond_1:1 \
UCX_IB_GID_INDEX=7 \
ucx_perftest -c 0

Client pod:

UCX_NET_DEVICES=mlx5_bond_1:1 \
UCX_IB_GID_INDEX=7 \
ucx_perftest <server_pod_ip> -t tag_lat -c 1

(mlx5_bond_1 corresponds to the RoCE device attached to bond1 in this pod.)

Problem / observed behavior

With UCX:

  • Traffic appears to be routed according to the main routing table, not according to ip rule + table 129/other per-bond tables.

  • If we intentionally break the route in main but keep a valid route in table 129, then:

    • ib_write_bw still succeeds (policy routing is honored),
    • but ucx_perftest fails or chooses the wrong path.
  • Removing the main table route makes ucx_perftest fail even when a correct route exists in the per-bond table selected by ip rule.

From the pod’s point of view UCX behaves as if Linux policy routing does not exist and only the main table is consulted, while other tools in the same netns do respect the policy routing.

Hypothesis

Our working hypothesis is:

  • UCX correctly selects the HCA/device (mlx5_bond_1:1) internally,

  • but when creating the underlying TCP or rdma_cm sockets, it does not:

    • bind a specific source IP, nor
    • use SO_BINDTODEVICE (or otherwise set oif in the flow).

As a result, from the Linux kernel’s FIB / policy routing point of view the flow looks like:

from 0.0.0.0, oif=0, to <server_pod_ip>

This does not match either:

from xxx.xxx.xxx.30 lookup 129
from all oif bond0 lookup 129

so the kernel falls through to:

32766: from all lookup main

This would explain why:

  • verbs perftest and NCCL respect ip rule (they either bind the source or operate at a different layer),
  • whereas UCX traffic inside the same netns always behaves as if only the main table exists.

Expected behavior

From a container / Kubernetes perspective, the expectation is:

  • When UCX runs inside a network namespace, it should allow the Linux kernel to apply the full routing policy of that netns (including ip rule and non-main tables), or

  • At least offer an option to integrate with policy routing, e.g.:

    • bind sockets to the chosen device (SO_BINDTODEVICE), and/or
    • bind sockets to an IP address associated with the selected UCX device,

so that from / oif-based ip rule can be triggered correctly and UCX traffic follows the same path as other RDMA users in that namespace.

Questions

  1. Is UCX currently expected to work with Linux policy routing (ip rule + multiple routing tables) inside a container network namespace?
  2. Does UCX intentionally only consult the main table for its reachability checks and endpoint creation, ignoring non-main tables?
  3. Is there any recommended configuration today to make UCX respect per-device policy routing in containerized, multi-bond environments?

Xunzhuo avatar Dec 03 '25 06:12 Xunzhuo

Hi @Xunzhuo,

Currently (since version 1.19.1) UCX pulls all the tables, not only main. However, it doesn't follow ip rule at all, but instead treats all routes as equal regardless of which table they came from. The PR mentioned above is going to add prioritization for routes with longer netmasks (more specific ones).

If we assume that other tables (such as 129 in your case) include more specific routes, this PR should solve the issue, even if implicitly. But if this is not the case, and there might be less specific routes in tables that should be prioritized by ip rule, we should consider adding support for policy routing.

Another thing, even if UCX chooses the correct device, it currently only binds the socket to its IP address, without using SO_BINDTODEVICE.

amastbaum avatar Dec 07 '25 15:12 amastbaum

Thank you @amastbaum !

Yes, I think the issue we are facing here is mainly about lacking of the support of policy routing. Any plans to support this? In container/kubernetes environment, policy routing is widely adopted in production.

Xunzhuo avatar Dec 07 '25 16:12 Xunzhuo

@yosefe we might need to add policy routing support. What do you think?

amastbaum avatar Dec 08 '25 10:12 amastbaum

Per offline discussion with @amastbaum , we'll check the possibility of some WA here.

gleon99 avatar Dec 09 '25 14:12 gleon99