whereabouts icon indicating copy to clipboard operation
whereabouts copied to clipboard

Handle pod delete with non-existent IPs

Open ashish-billore opened this issue 4 years ago • 15 comments

This patch makes whereabouts pods deployed as ds to use hostNetwork by default. This is helpful as CNI and core infra components use hostnetwork and hence avoid issues such as fault in CNI config etc and also making debugging much easier.

ashish-billore avatar Jul 08 '20 10:07 ashish-billore

@dougbtv @crandles @nicklesimba @aneeshkp Please have a look and give your opinion. Thanks!

ashish-billore avatar Jul 09 '20 01:07 ashish-billore

ok, so ds deployment related change for using hostNetwork is now moved out this pull request: https://github.com/dougbtv/whereabouts/pull/45 I'll fix this pull request to include @crandles comments.

ashish-billore avatar Jul 10 '20 08:07 ashish-billore

@dougbtv @crandles @nicklesimba @aneeshkp Please have a look and give your opinion. Thanks!

ashish-billore avatar Jul 14 '20 14:07 ashish-billore

+1 LGTM, but curious when this scenario occurs

Thanks @crandles for the review! In my case, I used an IP range with whereabouts and macvlan cni, created few pods and realized the range I used had conflict of IPs with another cluster, so I changed the IP range for whereabouts and then problem happened with following kubelet logs, which repeated endlessly.

As seen below, calico CNI and its IPAM was able to handle this situation gracefully with a warning. I felt whereabouts can also take a similar approach without erroring out on non-existent IPs. This will prevent endless attempts on kubelet to clean-up the pod sandbox.


Logs:

Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.727 [WARNING][12388] k8s.go 450: WorkloadEndpoint does not exist in the datastore, moving forward with the clean up ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" WorkloadEndpoint="node2.sl1--macvlan-k8s-mynginx--5777785f55--jrpfs-eth0" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.727 [INFO][12388] k8s.go 485: Cleaning up netns ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.727 [INFO][12388] k8s.go 492: Releasing IP address(es) ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.727 [INFO][12388] utils.go 168: Calico CNI releasing IP address ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.762 [INFO][12399] ipam_plugin.go 302: Releasing address using handleID ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" HandleID="k8s-pod-network.4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Workload="node2.sl1--macvlan-k8s-mynginx--5777785f55--jrpfs-eth0" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.762 [INFO][12399] ipam.go 1172: Releasing all IPs with handle 'k8s-pod-network.4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276' Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.775 [WARNING][12399] ipam_plugin.go 312: Asked to release address but it doesn't exist. Ignoring ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" HandleID="k8s-pod-network.4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Workload="node2.sl1--macvlan-k8s-mynginx--5777785f55--jrpfs-eth0" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.775 [INFO][12399] ipam_plugin.go 323: Releasing address using workloadID ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" HandleID="k8s-pod-network.4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Workload="node2.sl1--macvlan-k8s-mynginx--5777785f55--jrpfs-eth0" Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.775 [INFO][12399] ipam.go 1172: Releasing all IPs with handle 'default.mynginx-5777785f55-jrpfs' Jul 9 02:31:39 node2 kubelet: 2020-07-09 02:31:39.781 [INFO][12388] k8s.go 498: Teardown processing complete. ContainerID="4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" Jul 9 02:31:39 node2 kubelet: E0709 02:31:39.785678 11876 cni.go:385] Error deleting default_mynginx-5777785f55-jrpfs/4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 from network multus/multus-cni-network: delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 / delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 Jul 9 02:31:39 node2 kubelet: E0709 02:31:39.790173 11876 remote_runtime.go:128] StopPodSandbox "4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "mynginx-5777785f55-jrpfs_default" network: delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 / delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 Jul 9 02:31:39 node2 kubelet: E0709 02:31:39.790199 11876 kuberuntime_gc.go:170] Failed to stop sandbox "4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276" before removing: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "mynginx-5777785f55-jrpfs_default" network: delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 / delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 4cf6d1483120a31df2eb3afd2507a6275fb97694af427df1a0a5761b55191276 Jul 9 02:31:39 node2 kubelet: W0709 02:31:39.791959 11876 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e" Jul 9 02:31:39 node2 kubelet: Whereabouts logging: cannot open /var/log/whereabouts/whereabouts.log2020-07-09T02:31:39Z [debug] DEL - IPAM configuration successfully read: {Name:acn-lb-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[] DNS:{Nameservers:[] Domain: Search:[] Options:[]} Range:172.168.0.0/24 RangeStart:172.168.0.2 RangeEnd:172.168.0.254 GatewayStr:172.168.0.1 EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LogFile:/var/log/whereabouts/whereabouts.log LogLevel:debug Gateway:172.168.0.1 Kubernetes:{KubeConfigPath:/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:} Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [debug] ContainerID: 13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [debug] IPManagement -- mode: 1 / host: / containerID: 13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [error] Error deallocating IP: Did not find reserved IP for container 13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [error] Error deallocating IP: Did not find reserved IP for container 13f2e9878d40c61bce680f526df23ed0fd4362580bb100d62e41a3d66efeda0e Jul 9 02:31:39 node2 kubelet: 2020-07-09T02:31:39Z [debug] DEL - IPAM configuration successfully read: {Name:macvlan-1 Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[] DNS:{Nameservers:[] Domain: Search:[] Options:[]} Range:12.10.0.0/16 RangeStart:12.10.0.2 RangeEnd:12.10.255.254 GatewayStr:12.10.0.1 EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LogFile:/tmp/whereabouts.log LogLevel:debug Gateway:12.10.0.1 Kubernetes:{KubeConfigPath:/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:}

ashish-billore avatar Jul 22 '20 06:07 ashish-billore

Actually, this is also the expected behavior for IPAM as mentioned in the CNI spec here:

Plugins should generally complete a DEL action without error even if some resources are missing. For example, an IPAM plugin should generally release an IP allocation and return success even if the container network namespace no longer exists, unless that network namespace is critical for IPAM management.

ashish-billore avatar Jul 24 '20 05:07 ashish-billore

@crandles @dougbtv please suggest if anything is needed towards closing this pull request. Thanks!

ashish-billore avatar Aug 12 '20 02:08 ashish-billore

@ashish-billore ashish-billore @dougbtv , we are facing same issues when running cluster went down due to power outage . (simulated using forcefully power down server)

Error

E1006 05:52:39.220542 5964 remote_runtime.go:128] StopPodSandbox "869c45444de0b540e1babd380ae6e642d6c446b0 1fc603adb16ee1115a475bc9" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for sandbox "869c45444de0b540e1ba bd380ae6e642d6c446b01fc603adb16ee1115a475bc9": delegateDel: error invoking DelegateDel - "macvlan": error in getting result from DelNetwork: Error deallocating IP: Did not find reserved IP for container 869c45444de0b540e1babd380ae6e642d6c446b01fc603adb16ee1115a475bc9

Saw that same error log was thr and @crandles suggested some code changes.

Let us know if this is checked in or what is the plan to do that.

Appreciated

mkt-mohit avatar Oct 06 '20 09:10 mkt-mohit

Saw that same error log was thr and @crandles suggested some code changes.

Let us know if this is checked in or what is the plan to do that.

@mkt-mohit Changes suggested in @crandles comments have been incorporated and I see it has been approved after that.

We have been using whereabouts patched with changes as mentioned in this PR, in our setup to handle such cases. So, in case you want to check if it helps with your scenario, please feel free to use this patch, it should be simple to just patch whereabouts and build it locally.

ashish-billore avatar Oct 16 '20 08:10 ashish-billore

Saw that same error log was thr and @crandles suggested some code changes. Let us know if this is checked in or what is the plan to do that.

@mkt-mohit Changes suggested in @crandles comments have been incorporated and I see it has been approved after that.

We have been using whereabouts patched with changes as mentioned in this PR, in our setup to handle such cases. So, in case you want to check if it helps with your scenario, please feel free to use this patch, it should be simple to just patch whereabouts and build it locally.

I have taken latest checked docker image, but still getting same error wherein it stuck in loop to delete.

is there anything we can work on this from here? Suggestion appreciated

mkt-mohit avatar Oct 16 '20 10:10 mkt-mohit

I have taken latest checked docker image, but still getting same error wherein it stuck in loop to delete.

The latest docker image does not have this patch included, as this PR is reviewed and approved but not yet merged.

is there anything we can work on this from here?

You need to build your own docker image of whereabouts with this patch included to try this fix. Follow these instructions to make a locally built docker image of whereabouts from code with this patch applied.

ashish-billore avatar Oct 22 '20 02:10 ashish-billore

Sorry this got lagged behind! LGTM. Merging.

dougbtv avatar Nov 17 '20 15:11 dougbtv

Hrmmm, I also realize this conflicts with a recent change, so I need to rebase it. This solution is more ideal than what's currently there, I'll have to return to it.

dougbtv avatar Nov 17 '20 15:11 dougbtv

Seems problem with CI/CD system:

$ test -z $(gofmt -l pkg/. cmd/.)

/home/travis/.travis/functions: line 109: test: too many arguments

The command "test -z $(gofmt -l pkg/. cmd/.)" failed and exited with 2 during .

Your build has been stopped.

ashish-billore avatar Nov 23 '20 02:11 ashish-billore

recheck

ashish-billore avatar Nov 23 '20 02:11 ashish-billore

Sorry this got lost. If it's still a concern, can you add a test and we can see if it still exists, then we can rebase, thanks!

dougbtv avatar Jun 23 '22 13:06 dougbtv