antrea icon indicating copy to clipboard operation
antrea copied to clipboard

windows: node add route logic broken after antrea-agent restart

Open shettyg opened this issue 2 years ago • 5 comments

Describe the bug

On Windows, with noEncap node, if you restart antrea-agent, it will keep complaining about failure to add host route to a peer k8s node. The routes had already been added in a previous run. Errors are of the form:

I0411 21:07:54.045780   12976 node_route_controller.go:509] PeerNode IP is 21.0.90.105
W0411 21:07:55.528708   12976 net_windows.go:520] Failed to run command New-NetRoute -InterfaceIndex 9 -DestinationPrefix 192.168.5.0/24 -NextHop 21.0.89.252 -RouteMetric 256 -Verbose: : failed to run command 'New-NetRoute -InterfaceIndex 9 -DestinationPrefix 192.168.5.0/24 -NextHop 21.0.89.252 -RouteMetric 256 -Verbose': output '', exit status 1
E0411 21:07:55.529223   12976 node_route_controller.go:344] Error syncing Node 676db6c6b-w8n8k, requeuing. Error: failed to run command 'New-NetRoute -InterfaceIndex 9 -DestinationPrefix 192.168.5.0/24 -NextHop 21.0.89.252 -RouteMetric 256 -Verbose': output '', exit status 1
W0411 21:07:55.529223   12976 net_windows.go:520] Failed to run command New-NetRoute -InterfaceIndex 9 -DestinationPrefix 192.168.2.0/24 -NextHop 21.0.90.20 -RouteMetric 256 -Verbose: : failed to run command 'New-NetRoute -InterfaceIndex 9 -DestinationPrefix 192.168.2.0/24 -NextHop 21.0.90.20 -RouteMetric 256 -Verbose': output '', exit status 1
E0411 21:07:55.542144   12976 node_route_controller.go:344] Error syncing Node pool-1-566bf9d5fc-7b9w9, requeuing. Error: failed to run command 'New-NetRoute -InterfaceIndex 9 -DestinationPrefix 192.168.2.0/24 -NextHop 21.0.90.20 -RouteMetric 256 -Verbose': output '', exit status 1

Looking through the code (and adding some debug statements), the following happens.

  1. pkg/agent/route/route_windows.go:Reconcile() is called -> listRoutes() is called -> Only adds routes reachable via antrea-gw0. This skips all the routes reachable via host interface (noEncap mode)

  2. pkg/agent/route/route_windows.go:AddRoutes() is called for peer nodes. Here 'found' is false as you only have routes reachable via 'antrea-gw0'. So there is continuous tries to re-add the routes.

To Reproduce Restart antrea-agent on a windows node in noEncap mode.

Versions: Antrea 1.2 (actual reproduction) Antrea 1.6 (based on code perusal)

shettyg avatar Apr 11 '22 21:04 shettyg

For now, antrea doesn't list route entries on br-int interface (used in noEncap mode for accross Node routing) because user might have created other routes on this interface, and Antrea is possible to remove thoses entries by mistake in a reconcile. But antrea-gw0 is created by Antrea, so Antrea could operate it as desired.

To fix the issue, I am thinking maybe we should use a different metric on the routing entries created by Antrea, then we could read all these entries in reconcile (Currently, we only check destination and linkIndex to get routes). Maybe we could use a metric value as the default (256) -10 for the Pod routes in cross-Node scenario. What do you think @antoninbas @tnqn @jianjuns ?

wenyingd avatar Apr 18 '22 08:04 wenyingd

I check the code with main branch, and I didn't see the issue, and I didn't reproduce it on my testbed. I think it should be resolved by this change https://github.com/antrea-io/antrea/pull/2235/files#diff-7bcfac48fad8417e532efd53e1885724b2a3537025e64ab11030aa72a6893cfdR181

Although the current code still only lists routes on antrea-gw0, but when adding a route entry, it first checks the existence entry, and only add if not existing or update (remove-add) with the desired routes. For an installed entry as with the desired configuration, it should return directly. So we should not see the errors in logs. It is first introduced in Antrea v1.5, and back port to release-1.4. But for older versions, the code is not included.

wenyingd avatar Apr 18 '22 08:04 wenyingd

Thanks much for the pointer. Looks like I just need to get the following change to 1.2?

diff --git a/pkg/agent/route/route_windows.go b/pkg/agent/route/route_windows.go
index a889bd79..4734306b 100644
--- a/pkg/agent/route/route_windows.go
+++ b/pkg/agent/route/route_windows.go
@@ -138,7 +138,7 @@ func (c *Client) AddRoutes(podCIDR *net.IPNet, nodeName string, peerNodeIP, peer
                return nil
        }

-       if err := util.NewNetRoute(route); err != nil {
+       if err := util.ReplaceNetRoute(route); err != nil {
                return err
        }

diff --git a/pkg/agent/util/net_windows.go b/pkg/agent/util/net_windows.go
index da1abb0b..aa30de06 100644
--- a/pkg/agent/util/net_windows.go
+++ b/pkg/agent/util/net_windows.go
@@ -529,6 +529,37 @@ func RemoveNetRoute(route *Route) error {
        return err
 }

+func ReplaceNetRoute(route *Route) error {
+       rs, err := GetNetRoutes(route.LinkIndex, route.DestinationSubnet)
+       if err != nil {
+               return err
+       }
+
+       if len(rs) == 0 {
+               if err := NewNetRoute(route); err != nil {
+                       return err
+               }
+               return nil
+       }
+       found := false
+       for _, r := range rs {
+               if r.GatewayAddress.Equal(route.GatewayAddress) {
+                       found = true
+                       break
+               }
+       }
+       if found {
+               return nil
+       }
+       if err := RemoveNetRoute(route); err != nil {
+               return err
+       }
+       if err := NewNetRoute(route); err != nil {
+               return err
+       }
+       return nil
+}
+

shettyg avatar Apr 18 '22 17:04 shettyg

Yes, using "replaceRoute" should relove the issue.

wenyingd avatar May 13 '22 05:05 wenyingd

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

github-actions[bot] avatar Aug 12 '22 00:08 github-actions[bot]