cilium icon indicating copy to clipboard operation
cilium copied to clipboard

CI: Cilium E2E Upgrade - no-errors-in-logs - retrieving device lxc_health: Link not found

Open giorio94 opened this issue 1 year ago • 16 comments

CI failure

Hit on https://github.com/cilium/cilium/pull/32688 Link: https://github.com/cilium/cilium/actions/runs/9209728129/job/25335026937 Sysdump: cilium-sysdump-7-final.zip

❌ Found 2 logs in kind-kind/kube-system/cilium-d2588 (cilium-agent) matching list of errors that must be investigated:
time="2024-05-23T14:34:10Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2584 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.0.12 ipv6="fd00:10:244::1cd5" k8sPodName=/ subsys=endpoint (1 occurrences)
time="2024-05-23T14:34:10Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2584 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.0.12 ipv6="fd00:10:244::1cd5" k8sPodName=/ subsys=endpoint (1 occurrences)

giorio94 avatar May 23 '24 16:05 giorio94

Happening very frequently on https://github.com/cilium/cilium/pull/32725 too

gandro avatar May 27 '24 12:05 gandro

It looks like this has been consistently failing since May 22nd, with no failures occurring prior. Since May 22nd, here are the number of failures by matrix:

Job Matrix Entry Kernel Num Failures Link
7 bpf-next 25 link
6 6.1 1 link
14 5.14 1 link

nine eight

learnitall avatar May 29 '24 15:05 learnitall

I checked through the scheduled tests on main and the first one I found was https://github.com/cilium/cilium/actions/runs/9195482429/job/25291405008. This is right when https://github.com/cilium/cilium/pull/32518 was merged that is the last change to where "retrieving device" log message is coming from. @ti-mo or @lmb could you check if this might be the source of this issue?

joamaki avatar May 30 '24 08:05 joamaki

@rgo3 I heard you were taking a look so I added you as an assignee, just a heads up

learnitall avatar May 30 '24 14:05 learnitall

Hit on #32744 Link: https://github.com/cilium/cilium/actions/runs/9298554153/job/25590680316 Sysdump: cilium-sysdump-7-final.zip

 ❌ Found 2 logs in kind-kind/kube-system/cilium-ddg8l (cilium-agent) matching list of errors that must be investigated:
time="2024-05-30T07:29:45Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1552 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.210 ipv6="fd00:10:244:3::920d" k8sPodName=/ subsys=endpoint (1 occurrences)
time="2024-05-30T07:29:45Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1552 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.210 ipv6="fd00:10:244:3::920d" k8sPodName=/ subsys=endpoint (1 occurrences)

youngnick avatar May 31 '24 00:05 youngnick

Hit this here I think: https://github.com/cilium/cilium/actions/runs/9307792492/job/25619833558

tommyp1ckles avatar May 31 '24 02:05 tommyp1ckles

So from my initial investigation yesterday I would say https://github.com/cilium/cilium/pull/32518 is what caused this error to show up in the error logs because the log level changed from Warn to Error, but is not the underlying cause of why we can't find lxc_health. Jussi already suspected that in the discussion offline.

From looking at the logs of a sysdump where it happened, I can see some sort of race between loader and initializing the health endpoint:

6103:2024-05-30T19:06:25.246590173Z time="2024-05-30T19:06:25Z" level=debug msg="Killing old health endpoint process" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6104:2024-05-30T19:06:25.247099202Z time="2024-05-30T19:06:25Z" level=debug msg="Killed endpoint process" pid=1037 pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6368:2024-05-30T19:06:26.497522773Z time="2024-05-30T19:06:26Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6380:2024-05-30T19:06:26.497697378Z time="2024-05-30T19:06:26Z" level=info msg="generating BPF for endpoint failed, keeping stale directory" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" file-path=2033_next_fail identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6384:2024-05-30T19:06:26.497977680Z time="2024-05-30T19:06:26Z" level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=899.216455ms bpfWaitForELF="26.008µs" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ mapSync=9.941627ms policyCalculation="900.128µs" prepareBuild=7.318288ms proxyConfiguration="11.18µs" proxyPolicyCalculation="167.333µs" proxyWaitForAck=0s reason="updated security labels" subsys=endpoint total=1m51.825101738s waitingForCTClean=59.918287ms waitingForLock=1m50.690587536s
6385:2024-05-30T19:06:26.497982489Z time="2024-05-30T19:06:26Z" level=debug msg="Error regenerating endpoint: retrieving device lxc_health: Link not found" ciliumEndpointName=/ code=Failure containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 endpointState=ready identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ policyRevision=0 subsys=endpoint type=200
6400:2024-05-30T19:06:26.510255562Z time="2024-05-30T19:06:26Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6444:2024-05-30T19:06:26.585168967Z time="2024-05-30T19:06:26Z" level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=cilium_health
6445:2024-05-30T19:06:26.585171281Z time="2024-05-30T19:06:26Z" level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=lxc_health
6446:2024-05-30T19:06:26.601161281Z time="2024-05-30T19:06:26Z" level=debug msg="Created veth pair" subsys=endpoint-connector vethPair="[cilium lxc_health]"
6463:2024-05-30T19:06:26.633890277Z time="2024-05-30T19:06:26Z" level=debug msg="Spawning health endpoint with command \"cilium-health-responder\" [\"--listen\" \"4240\" \"--pidfile\" \"/var/run/cilium/state/health-endpoint.pid\"]" subsys=cilium-health-launcher
6507:2024-05-30T19:06:27.687842061Z time="2024-05-30T19:06:27Z" level=debug msg="cilium-health agent running" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6508:2024-05-30T19:06:27.687921610Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [route add 10.244.3.45/32 dev cilium]\"" subsys=cilium-health-launcher
6509:2024-05-30T19:06:27.720737217Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [route add 0.0.0.0/0 via 10.244.3.45 mtu 1450 dev cilium]\"" subsys=cilium-health-launcher
6515:2024-05-30T19:06:27.812147656Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [-6 route add fd00:10:244:3::29a9/128 dev cilium]\"" subsys=cilium-health-launcher
6519:2024-05-30T19:06:27.848067069Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [-6 route add ::/0 via fd00:10:244:3::29a9 mtu 1450 dev cilium]\"" subsys=cilium-health-launcher
6726:2024-05-30T19:06:29.196394945Z time="2024-05-30T19:06:29Z" level=info msg="Program cil_from_container attached to device lxc_health using tcx" subsys=datapath-loader

First two log lines are from the health endpoint (re)initialization, cleaning up the old endpoint after the upgrade. Then we see logs from the loader, trying to load the datapath for the health endpoint but failing because the netlink device has been deleted. After that, we successfully recreate the health endpoint (see Created veth pair, Spawning health endpoint...) after which the loader successfully attaches the datapath ... attached to device lxc_health using tcx.

What's not entirely clear to me is why the loader is trying to attach to the health endpoint in the first place. I'm guessing that the cleanup code for the health endpoint is not removing the endpoint from the endpoint manager (or something along these lines) on shutdown but I think that should happen when the agent shuts down.

rgo3 avatar May 31 '24 07:05 rgo3

I'm also wondering why the test fails disproportionately with the following configuration of Cilium:

https://github.com/cilium/cilium/blob/8de7a903400aa237600c1f49d0d4ef16503c2ee3/.github/workflows/tests-e2e-upgrade.yaml#L154-L164

joestringer avatar Jun 04 '24 23:06 joestringer

I'm also wondering why the test fails disproportionately with the following configuration of Cilium:

Given with tcx we switched all attachments to BPF links which should not be the case for <6.4ish kernels. We can make this a release blocker for the final 1.16.0 and I'll help Robin with the investigation + fix in the meantime.

borkmann avatar Jun 06 '24 13:06 borkmann

Another hit on https://github.com/cilium/cilium/pull/33240 Workflow: https://github.com/cilium/cilium/actions/runs/9572595235/job/26392256829 Sysdump: cilium-sysdump-6-final.zip

thorn3r avatar Jun 19 '24 13:06 thorn3r

Hm, to follow-up on @rgo3's path potentially this one is interesting: https://github.com/cilium/cilium/blob/main/daemon/cmd/health.go#L119

In initHealth() we have:

	// Make sure to clean up the endpoint namespace when cilium-agent terminates
	cleaner.cleanupFuncs.Add(health.KillEndpoint)
	cleaner.cleanupFuncs.Add(health.CleanupEndpoint)

However, cleanupHealthEndpoint() also removes it as an endpoint which the above does not seem to do ...

func (d *Daemon) cleanupHealthEndpoint() {
	// Delete the process
	health.KillEndpoint()

	// Clean up agent resources
	var ep *endpoint.Endpoint
	healthIPv4 := node.GetEndpointHealthIPv4()
	healthIPv6 := node.GetEndpointHealthIPv6()
	if healthIPv4 != nil {
		ep = d.endpointManager.LookupIPv4(healthIPv4.String())
	}
	if ep == nil && healthIPv6 != nil {
		ep = d.endpointManager.LookupIPv6(healthIPv6.String())
	}
	if ep == nil {
		log.Debug("Didn't find existing cilium-health endpoint to delete")
	} else {
		log.Debug("Removing existing cilium-health endpoint")
		errs := d.deleteEndpointQuiet(ep, endpoint.DeleteConfig{
			NoIPRelease: true,
		})
		for _, err := range errs {
			log.WithError(err).Debug("Error occurred while deleting cilium-health endpoint")
		}
	}
	health.CleanupEndpoint()
}

borkmann avatar Jun 26 '24 13:06 borkmann

Do we have any guarantees that the StopFunc on a controller is run when agent shuts down? If not we could try adding func (d *Daemon) cleanupHealthEndpoint() to the cleanup funcs of the daemon cell and see if that makes a difference?

rgo3 avatar Jun 26 '24 14:06 rgo3

Do we have any guarantees that the StopFunc on a controller is run when agent shuts down?

No, controllers are not stopped when the agent shuts down. I think the StopFunc for the health EP controller is actually never ever called.

I think your solution of adding the cleanup call to the cleanup callbacks makes sense

gandro avatar Jun 26 '24 14:06 gandro

(potential fix https://github.com/cilium/cilium/pull/33409 )

borkmann avatar Jun 26 '24 15:06 borkmann

So my PR unfortunately didn't change anything :disappointed:, running setup 7 with my "fix" applied still failed the same way. Now I am wondering if we might have the same problem as with the controller. Is it guaranteed that the cleanup funcs of the daemon will be run?

rgo3 avatar Jun 27 '24 14:06 rgo3

I fixed an issue a while back that was similar (but not specifically for the lxc_health endpoint). My solution at the time was to be more strict about waiting for the cilium uninstallto complete in cilium-cli: https://github.com/cilium/cilium-cli/pull/2344

From seeing several of these failures recently, the also only occur on upgrade tests and happen on the post-upgrade part of the test.

tommyp1ckles avatar Jun 27 '24 18:06 tommyp1ckles

Is it guaranteed that the cleanup funcs of the daemon will be run?

If they don't complete within 1 second, they might not run to completion, since we have a grace period if only one second. Also in general, if the cilium pods are SIGKILLed, they will not run either of course.

It should be fairly easy to detect if they are running or not by issuing a log line and seeing if they were called or not. However, instead of relying on cleanup (which is never fully reliable and always best effort), it might be more robust to cleanup old state during startup.

gandro avatar Jul 01 '24 08:07 gandro

Is there a rationale for the 1s? Maybe we could i) bump to 5s and ii) have the log message to see if this has an effect as Sebastian mentioned.

borkmann avatar Jul 01 '24 09:07 borkmann

IMO, it would be cleaner to defer the cleanup to the startup logic, rather than playing with timeouts. The shutdown hooks are always best effort

gandro avatar Jul 01 '24 09:07 gandro

Some additional findings from last friday:

  1. That my PR doesn't fix anything is easy to explain given we are going from a 1.15 cluster to main, and my "potential" fix could only fix things if it were applied in 1.15.
  2. I was able to verify with a kind cluster that without my PR, we don't clean up the state directory of the health endpoint, meaning that after restart we potentially try to restore it. With my PR applied the state directory of the health endpoint is removed. Simple reproducer here is installing cilium, run kubectl -n kube-system patch daemonset cilium -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}' to shutdown all agents, ssh into one of the k8s nodes and look for the state directory of the health ep in /run/cilium/state.
  3. Unfortunately I found the following code, which is called when we restore endpoints from disk. We already special case the health endpoint during endpoint restore, meaning even if the state directory is left on disk we shouldn't restore the health endpoint to the endpoint manager. So we already have cleanup logic on startup which indicates to me that something else is going on.

We still might want to merge my PR, without the CI changes, since it improves our clean up "hygiene", but I'm not convinced that it will fix this issue :disappointed:. I'm unassigning me from this issue now, as I unfortunately don't have cycles to continue the investigation. cc @msune

rgo3 avatar Jul 01 '24 09:07 rgo3

(Ok, np, I'll trigger a debug PR tonight with first adding the log lines to see where we are.)

borkmann avatar Jul 01 '24 12:07 borkmann

This issue is at risk for v1.16 timeline per release blocker status. Do we continue to believe the severity of this is medium or high and that we have a path to resolve it in the next couple of weeks?

joestringer avatar Jul 08 '24 20:07 joestringer

@rgo3 Where you able to reproduce by hard killing the Cilium pods, or just show that the cleanup doesn't happen?

tommyp1ckles avatar Jul 17 '24 22:07 tommyp1ckles

Took a deeper look today, I added a bunch of logging to my branch to trace this so I'll roughly outline what I think is happening:

  • 47:48 Health endpoint created: 2024-07-18T22:46:48.562722733Z time="2024-07-18T22:46:48Z" level=info msg="New endpoint" ciliumEndpointName=/ containerID= containerInterface= dat apathPolicyRevision=0 desiredPolicyRevision=0 endpointID=470 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ subsys=endpoint

  • At this point, other endpoints are regenerating, "regenerate-all-endpoints" is called due to {eth0,eth1} devices being detected.

  • 48:16: Finally, after a bunch of other endpoints have been regenerated, the health-endpoint is de-queued near the end (probably because restored endpoints are added to endpoint queue prior to initHealth).

    • Note: At this point, it's been nearly 30 seconds, maybe giving us an indication of how quickly endpoint regen is progressing.
  • 49:49: At almost 3 minutes after the health endpoint was created, there is a call to remove the endpoint, I believe this happens because of the health endpoint controller ping timeout (which is 3 minutes): 2024-07-18T22:49:49.376847687Z time="2024-07-18T22:49:49Z" level=debug msg="Removing existing cilium-health endpoint" subsys=daemon

    • Question is why is this taking so long?
  • We see that the endpoint is deleted from: 2024-07-18T22:49:49.376986235Z time="2024-07-18T22:49:49Z" level=debug msg="exiting retrying regeneration goroutine due to endpoint being deleted" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=470 identity=4 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ subsys=endpoint

  • At this point, I see some of my own logs that the device_controller (i.e. watching for netlink events) sees lxc_health being deleted (msg_type=0x11 is DELLINK msg from netlin).

2024-07-18T22:49:49.398767304Z [tom-debug] [device] name=lxc_health index=23 msg_type=0x10
2024-07-18T22:49:49.399654068Z [tom-debug] [device] name=lxc_health index=23 msg_type=0x11
  • This is picked up by the device-reloader as a device change and kicks off a "regenerate-all-endpoints": 2024-07-18T22:50:46.937230285Z time="2024-07-18T22:50:46Z" level=info msg="regenerating all endpoints" reason="devices changed" subsys=endpoint-manager

Then we see all endpoints being regenerated again, at which point, it appears as though the lxc_health endpoint continues to be generated:

2024-07-18T22:50:46.937230285Z time="2024-07-18T22:50:46Z" level=info msg="regenerating all endpoints" reason="devices changed" subsys=endpoint-manager
2024-07-18T22:50:46.947171818Z time="2024-07-18T22:50:46Z" level=debug msg="Starting policy recalculation..." ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=470 identity=4 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ subsys=endpoint
2024-07-18T22:50:46.947188219Z time="2024-07-18T22:50:46Z" level=debug msg="Storing desiredRedirects: map[]" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=470 identity=4 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ subsys=endpoint
2024-07-18T22:50:46.947310738Z time="2024-07-18T22:50:46Z" level=debug msg="BPF header file hashed (was: \"\")" bpfHeaderfileHash=bb8940f30264d1eee79e7c81ea928120c3a7608f66e42c728a55916e6be7d7eb ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=470 identity=4 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ subsys=endpoint
2024-07-18T22:50:46.947327108Z time="2024-07-18T22:50:46Z" level=debug msg="writing header file" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=470 file-path=470_next/ep_config.h identity=4 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ subsys=endpoint
2024-07-18T22:50:46.947568789Z time="2024-07-18T22:50:46Z" level=debug msg="writing header file with DNSRules" DNSRulesV2="map[]" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=470 file-path=470_next/ep_config.h identity=4 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ subsys=endpoint
2024-07-18T22:50:47.078830092Z time="2024-07-18T22:50:47Z" level=debug msg="Preparing to compile BPF" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=470 identity=4 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ regeneration-level=rewrite+load subsys=endpoint
2024-07-18T22:50:47.079505040Z time="2024-07-18T22:50:47Z" level=debug msg="Load 1-min: 27.95 5-min: 22.57 15min: 11.53" endpointID=470 subsys=endpoint
2024-07-18T22:50:47.079876413Z time="2024-07-18T22:50:47Z" level=debug msg="Memory: Total: 11969 Used: 3529 (29.49%) Free: 308 Buffers: 139 Cached: 7773" endpointID=470 subsys=endpoint
2024-07-18T22:50:47.080116611Z time="2024-07-18T22:50:47Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" endpointID=470 subsys=endpoint

But because the interface has been deleted, we get the error:

2024-07-18T22:50:48.573823921Z time="2024-07-18T22:50:48Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=470 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.2.29 ipv6="fd00:10:244:2::a07b" k8sPodName=/ subsys=endpoint

End of day conclusions

  • Why is there such a long gap between the health endpoint create starting and being ready (i.e. responding to pings). You can see this in the tests as well, in particular, in test tuple 7 w/ bpf-next, Cilium takes quite a while to become ready after the upgrade because of health endpoints failing. So it seems like there is some kind of problem with setting up the health endpoint that is exacerbated by the configuration in test tuple-7 (i.e. bpf-next, ...).
    • This might be contributing to the underlying issue as it's more likely to cause the health endpoint to be recreated due to the 3-minute timeout

Other thoughts/questions

  • Does the device manager doing a regenerate all endpoints as a result of lxc_health make sense? Maybe we should exclude those interfaces from causing endpoint regeneration - if just intially.
  • As @rgo3 noted, there is logic to cleanup endpoint resources for restored lxc_health endpoints, this seems redundant as we also will remove such interfaces here.
  • I don't fully understand how an endpoint that's been removed from the endpointmanager gets regenerated (or maybe is just continuing to be regenerated). There is no second call to create another endpoint yet, so I wonder if the regenerate logic is just taking a while and is already past the point of being cancelled (i.e. dequeud and running).

tommyp1ckles avatar Jul 19 '24 04:07 tommyp1ckles

Excellent debugging!

Does the device manager doing a regenerate all endpoints as a result of lxc_health make sense? Maybe we should exclude those interfaces from causing endpoint regeneration - if just intially.

This does not look right, lxc_health needs to be excluded. I also saw this triggering via device change detected.

As @rgo3 noted, there is logic to cleanup endpoint resources for restored lxc_health endpoints, this seems redundant as we also will remove such interfaces here.

I presume the endpoint deletion itself might also delete the device in addition to cleanupHealthEndpoint(). I'd think that other than the agent teardown + startup we should only need KillEndpoint() and deleteEndpointQuiet() to avoid interference with the regular machinery.

I don't fully understand how an endpoint that's been removed from the endpointmanager gets regenerated (or maybe is just continuing to be regenerated). There is no second call to create another endpoint yet, so I wonder if the regenerate logic is just taking a while and is already past the point of being cancelled (i.e. dequeud and running).

This I also don't understand yet, we have the endpoint locks scattered throughout the endpointmanager code and it does check Disconnecting / Disconnected state where it ends up bailing out, but it feels like somehow we manage to get past this point when attempting to attach.

borkmann avatar Jul 19 '24 08:07 borkmann

I presume the endpoint deletion itself might also delete the device in addition to cleanupHealthEndpoint(). I'd think that other than the agent teardown + startup we should only need KillEndpoint() and deleteEndpointQuiet() to avoid interference with the regular machinery.

I wonder, maybe that is the issue.. that we're front-running the actual endpoint lifecycle here (as we also don't see this with regular endpoints). I'll give this a try in my test branch.

borkmann avatar Jul 19 '24 08:07 borkmann

Excellent debugging!

Does the device manager doing a regenerate all endpoints as a result of lxc_health make sense? Maybe we should exclude those interfaces from causing endpoint regeneration - if just intially.

This does not look right, lxc_health needs to be excluded. I also saw this triggering via device change detected.

Yep, this might be due to device-reloader.go not being conservative enough in checking whether the node addresses really changed and we might've had some device state changes that triggered refresh of into node address table without actually changing the contents (but still updating the objects),. Unfortunate that https://github.com/cilium/cilium/pull/33023 didn't make it to v1.16 as it does this in a much more principled way.

A fix for v1.16 if we don't backport 33023 could be to add a check for the addresses themselves in the same style it checks the device names: https://github.com/cilium/cilium/blob/v1.16/daemon/cmd/device-reloader.go#L110. Or alternatively/additionally making sure NodeAddress objects are updated if the contents do not change to avoid signalling changes to watchers unnecessarily.

@dylandreimerink perhaps you could consider backporting it to v1.16 if it's not too many changes as the new way of doing this is much better than the device-reloader.go thing?

As @rgo3 noted, there is logic to cleanup endpoint resources for restored lxc_health endpoints, this seems redundant as we also will remove such interfaces here.

I presume the endpoint deletion itself might also delete the device in addition to cleanupHealthEndpoint(). I'd think that other than the agent teardown + startup we should only need KillEndpoint() and deleteEndpointQuiet() to avoid interference with the regular machinery.

I don't fully understand how an endpoint that's been removed from the endpointmanager gets regenerated (or maybe is just continuing to be regenerated). There is no second call to create another endpoint yet, so I wonder if the regenerate logic is just taking a while and is already past the point of being cancelled (i.e. dequeud and running).

This I also don't understand yet, we have the endpoint locks scattered throughout the endpointmanager code and it does check Disconnecting / Disconnected state where it ends up bailing out, but it feels like somehow we manage to get past this point when attempting to attach.

joamaki avatar Jul 19 '24 08:07 joamaki

Excellent debugging!

Does the device manager doing a regenerate all endpoints as a result of lxc_health make sense? Maybe we should exclude those interfaces from causing endpoint regeneration - if just intially.

This does not look right, lxc_health needs to be excluded. I also saw this triggering via device change detected.

Yep, this might be due to device-reloader.go not being conservative enough in checking whether the node addresses really changed and we might've had some device state changes that triggered refresh of into node address table without actually changing the contents (but still updating the objects),. Unfortunate that #33023 didn't make it to v1.16 as it does this in a much more principled way.

A fix for v1.16 if we don't backport 33023 could be to add a check for the addresses themselves in the same style it checks the device names: https://github.com/cilium/cilium/blob/v1.16/daemon/cmd/device-reloader.go#L110. Or alternatively/additionally making sure NodeAddress objects are updated if the contents do not change to avoid signalling changes to watchers unnecessarily.

@dylandreimerink perhaps you could consider backporting it to v1.16 if it's not too many changes as the new way of doing this is much better than the device-reloader.go thing?

Sounds good, it could certainly reduce quite a bit of CPU load for users if the health endpoint is often triggered for recreation and subsequently all endpoints.

I just pushed a change to https://github.com/cilium/cilium/pull/33700 to let the endpoint lifecycle handle device removal (except for agent start and teardown) to see if this removes the error of lxc_health being gone. Although I suspect there is no mechanism in the endpoint regeneration logic to wait a grace period if an endpoint regen is still in flight. Dirty hack might be to wait until the lxc_health is gone before we relaunch the health endpoint.

borkmann avatar Jul 19 '24 09:07 borkmann

Another interesting log (note the endpoints 791 vs 1289):

2024-07-19T09:53:24.645444393Z time=2024-07-19T09:53:24Z level=debug msg="upserting health status" module=health lastLevel=none reporter-id="agent.controlplane.endpoint-manager.cilium-endpoint-791 (cilium-test/test-conn-disrupt-client-5ccc6f6497-m52rv).datapath-regenerate" status="agent.controlplane.endpoint-manager.cilium-endpoint-791 (cilium-test/test-conn-disrupt-client-5ccc6f6497-m52rv).datapath-regenerate: [OK] Endpoint regeneration successful"
2024-07-19T09:53:24.651530130Z time=2024-07-19T09:53:24Z level=debug msg="stopping health reporter" module=health reporter-id=agent.controlplane.stale-endpoint-cleanup.job-endpoint-cleanup
[... few unrelated logs in between ...]
2024-07-19T09:53:24.697965383Z time="2024-07-19T09:53:24Z" level=debug msg="Inserted program handle_policy into cilium_call_policy slot 1289" subsys=datapath-loader
2024-07-19T09:53:24.708205909Z time="2024-07-19T09:53:24Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1289 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.1.151 ipv6="fd00:10:244:1::2ac9" k8sPodName=/ subsys=endpoint

Interestingly tracing further above wrt 1289:

2024-07-19T09:53:23.339195502Z time="2024-07-19T09:53:23Z" level=debug msg="Preparing to compile BPF" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1289 identity=4 ipv4=10.244.1.151 ipv6="fd00:10:244:1::2ac9" k8sPodName=/ regeneration-level=rewrite+load subsys=endpoint
2024-07-19T09:53:23.341090692Z time="2024-07-19T09:53:23Z" level=debug msg="NAME cilium-agent STATUS S PID 1 CPU: 33.41% MEM: 4.49% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 537 VMS: 1707 Data: 497 Stack: 0 Locked: 0 Swap: 0" endpointID=3488 subsys=endpoint
2024-07-19T09:53:23.342812294Z time="2024-07-19T09:53:23Z" level=debug msg="Load 1-min: 23.53 5-min: 15.78 15min: 9.74" endpointID=1289 subsys=endpoint
2024-07-19T09:53:23.343293060Z time="2024-07-19T09:53:23Z" level=debug msg="Memory: Total: 11969 Used: 3572 (29.85%) Free: 204 Buffers: 148 Cached: 7819" endpointID=1289 subsys=endpoint
2024-07-19T09:53:23.343759669Z time="2024-07-19T09:53:23Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" endpointID=1289 subsys=endpoint
[...]
2024-07-19T09:53:24.645272293Z time="2024-07-19T09:53:24Z" level=info msg="Restored endpoint" endpointID=791 ipAddr="[10.244.1.160 fd00:10:244:1::69eb]" subsys=endpoint
2024-07-19T09:53:24.645275629Z time="2024-07-19T09:53:24Z" level=info msg="Finished regenerating restored endpoints" regenerated=7 subsys=daemon total=7
2024-07-19T09:53:24.645279436Z time="2024-07-19T09:53:24Z" level=debug msg="Loading Collection into kernel" subsys=datapath-loader
2024-07-19T09:53:24.645282492Z time="2024-07-19T09:53:24Z" level=debug msg="Registered BPF map" path=/sys/fs/bpf/tc/globals/cilium_call_policy subsys=bpf
2024-07-19T09:53:24.645285578Z time="2024-07-19T09:53:24Z" level=debug msg="Unregistered BPF map" path=/sys/fs/bpf/tc/globals/cilium_call_policy subsys=bpf
07-19T09:53:24.645292430Z time="2024-07-19T09:53:24Z" level=debug msg="Failed to remove ID 1289 from global policy map" error="unable to delete element Endpoint: 1289 from map cilium_call_policy: delete: key does not exist" subsys=datapath-maps
2024-07-19T09:53:24.645295657Z time="2024-07-19T09:53:24Z" level=info msg="Removed stale bpf map" file-path=/sys/fs/bpf/tc/globals/cilium_policy_01289 subsys=daemon

logs-cilium-4v2l9-cilium-agent-20240719-101402.log

borkmann avatar Jul 19 '24 11:07 borkmann

[...]

As @rgo3 noted, there is logic to cleanup endpoint resources for restored lxc_health endpoints, this seems redundant as we also will remove such interfaces here.

I presume the endpoint deletion itself might also delete the device in addition to cleanupHealthEndpoint(). I'd think that other than the agent teardown + startup we should only need KillEndpoint() and deleteEndpointQuiet() to avoid interference with the regular machinery.

From reading the code, it looks like the endpoint termination just unloads tc(x) and puts the device down so no traffic can leak. And for normal Pods, later on the CNI plugin does the final link.DeleteByName(). Given the health endpoint is "special" iiuc, there won't be a CNI triggered removal event.

borkmann avatar Jul 19 '24 12:07 borkmann