alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

fix: set context timeout for resolvePeers

Open eyazici90 opened this issue 9 months ago • 2 comments

Context

Given resolvePeers is context aware, yet, it is used by background context that may lead to blocking call. This is due to the fact that LookupIPAddr is called under the hood that does dns look up which may be blocked in some platforms.

In case of dns lookups are blocked, it may lead to hanging goroutine due to the fact that main goroutine is being used for the call that leads to application hanging & being unresponsive as well as keep getting restarted by kube with no info in the app.

Below logs are taken from one of the alertmanager kube/pods.

│ ts=2025-04-03T11:12:38.668Z caller=main.go:181 level=info msg="Starting Alertmanager" version="(version=0.27.0, branch=HEAD, revision=x)" │ │ ts=2025-04-03T11:12:38.668Z caller=main.go:182 level=info build_context="(go=go1.21.7, platform=linux/amd64, user=root@x, date=20240228-11:51:20, tags=netgo)"

Cillumpolicy was not applied to allow to dns resolve yet. This led to app hangs with no logs & keep getting restarted due to liveness probes(/healthy, /ready) were not registered. Which was pretty painful to troubleshoot.

This was due to LookupIPAddr hanging on main goroutine.

select {
	case <-ctx.Done():
		// Our context was canceled. If we are the only
		// goroutine looking up this key, then drop the key
		// from the lookupGroup and cancel the lookup.
		// If there are other goroutines looking up this key,
		// let the lookup continue uncanceled, and let later
		// lookups with the same key share the result.
		// See issues 8602, 20703, 22724.
		if r.getLookupGroup().ForgetUnshared(lookupKey) {
			lookupGroupCancel()
			go dnsWaitGroupDone(ch, func() {})
		} else {
			go dnsWaitGroupDone(ch, lookupGroupCancel)
		}
		err := newDNSError(mapErr(ctx.Err()), host, "")
		if trace != nil && trace.DNSDone != nil {
			trace.DNSDone(nil, false, err)
		}
		return nil, err
	case r := <-ch:
		dnsWaitGroup.Done()
		lookupGroupCancel()
		err := r.Err
		if err != nil {
			if _, ok := err.(*DNSError); !ok {
				err = newDNSError(mapErr(err), host, "")
			}
		}
		if trace != nil && trace.DNSDone != nil {
			addrs, _ := r.Val.([]IPAddr)
			trace.DNSDone(ipAddrsEface(addrs), r.Shared, err)
		}
		return lookupIPReturn(r.Val, err, r.Shared)
	}

Letting the control flow continues will lead to l.Debug("resolved peers to following addresses", "peers", strings.Join(resolvedPeers, ",")) that will not block the app to log as well as registering liveness probes(/healthy, /ready).

eyazici90 avatar Apr 03 '25 15:04 eyazici90

Hey @grobinson-grafana 👋 Could u take a quick glance at that, lmk what u think? Based on ur response, we may push it further or not.

eyazici90 avatar Apr 03 '25 15:04 eyazici90

Could u help me with review @SuperQ ?

eyazici90 avatar Oct 18 '25 10:10 eyazici90

Looks like a couple tests need fixing.

SuperQ avatar Nov 06 '25 17:11 SuperQ

I guess if other people are also concerned, we could bump it to 30s by default, but that seems extremely long.

I am fine to change it as default: 30s, however, imo 15s is already enough as default.

eyazici90 avatar Nov 06 '25 18:11 eyazici90

Looks like a couple tests need fixing.

Fixed @SuperQ

eyazici90 avatar Nov 06 '25 18:11 eyazici90