gitness Error "connect: cannot assign requested address" due to goroutine / socket leak

We recently upgraded to Drone 1.6.1 running in Kubernetes and just had an issue where all builds were failing with the following error:

Post http://localhost:3001: dial tcp 127.0.0.1:3001: connect: cannot assign requested address

which is the DRONE_CONVERT_PLUGIN_ENDPOINT and in our case is the drone-convert-starlark extension running as a sidecar in the pod spec. Restarting the drone-ci server pod fixed our issue. But having seen that error before (usually indicating exhaustion of ephemeral port range) I started looking into metrics and found what looks like a resource leak related to how the server calls the converter.

process_open_fds{kubernetes_namespace="drone"} grew to 28267 over ~22 days go_goroutines{kubernetes_namespace="drone"} grew to 56547 indicating a potential issue with goroutines not returning/finishing

Investigating the open connections inside the new pod we saw the connections from localhost to the converter service on 127.0.0.1:3001 were often (but not always) staying in the ESTABLISHED state. Established connections to the secret endpoint (drone-vault in our case) also may be leaking but at a much slower pace.

Here you can see 4 samples each taken 10 minutes apart, showing src IPs and the local listening port they are connected to. In this data all the sockets are in the ESTABLISHED state, and you can also see the go_goroutines gauge metric is steadily increasing.

/ # for run in `seq 1 10`; do uptime;netstat -nat > /dev/shm/netstat; echo "Totals: $(curl -s 127.0.0.1:80/metrics|grep '^go_goroutines')       Established $(grep 'ESTABLISHED' /dev/shm/netstat | wc -l)"; gawk 'BEGIN {OFS="\t"} $1 ~ /tcp/ && $6 !~ /LISTEN/ && $4 ~ /:(3000|3001)$/{ndst=split($4,dst,":");nsrc=split($5,src,":");groups[(src[nsrc-1] "-> :" dst[ndst])]++;states[(src[nsrc-1] "-> :" dst[ndst])][$6]++}END{PROCINFO["sorted_in"] = "@val_num_desc";for (g in groups) if (groups[g]>5) {statelist=" ";for (s in states[g]) statelist=statelist s "=" states[g][s] " ";print groups[g],g,statelist}}' /dev/shm/netstat; sleep 600;done;

 23:50:40 up 6 days,  6:07,  load average: 0.04, 0.75, 2.24
Totals: go_goroutines 2295   Established 2545
1088	127.0.0.1-> :3001      ESTABLISHED=1088 
75	10.97.82.95-> :3000	     ESTABLISHED=75 
69	10.102.153.135-> :3000	 ESTABLISHED=69 
57	10.101.239.149-> :3000	 ESTABLISHED=57 
48	10.96.21.19-> :3000	     ESTABLISHED=48 
37	10.97.82.93-> :3000	     ESTABLISHED=37 
17	10.97.111.6-> :3000	     ESTABLISHED=17 
16	10.107.202.127-> :3000	 ESTABLISHED=16 
7	10.102.229.196-> :3000	   ESTABLISHED=7 

 00:00:41 up 6 days,  6:17,  load average: 0.00, 0.08, 1.15
Totals: go_goroutines 2350   Established 2637
1136	127.0.0.1-> :3001      ESTABLISHED=1136 
77	10.97.82.95-> :3000	     ESTABLISHED=77 
70	10.102.153.135-> :3000	 ESTABLISHED=70 
55	10.101.239.149-> :3000	 ESTABLISHED=55 
51	10.96.21.19-> :3000	     ESTABLISHED=51 
38	10.97.82.93-> :3000	     ESTABLISHED=38 
22	10.97.111.6-> :3000	     ESTABLISHED=22 
18	10.107.202.127-> :3000	 ESTABLISHED=18 
9	10.102.229.196-> :3000	   ESTABLISHED=9 

 00:10:41 up 6 days,  6:27,  load average: 0.20, 0.10, 0.63
Totals: go_goroutines 2459   Established 2746
1167	127.0.0.1-> :3001      ESTABLISHED=1167 
76	10.97.82.95-> :3000	     ESTABLISHED=76 
69	10.102.153.135-> :3000	 ESTABLISHED=69 
59	10.101.239.149-> :3000	 ESTABLISHED=59 
52	10.96.21.19-> :3000	     ESTABLISHED=52 
42	10.97.82.93-> :3000	     ESTABLISHED=42 
24	10.97.111.6-> :3000	     ESTABLISHED=24 
21	10.107.202.127-> :3000	 ESTABLISHED=21 
13	10.102.229.196-> :3000	 ESTABLISHED=13 

 00:20:42 up 6 days,  6:37,  load average: 0.19, 0.32, 0.47
Totals: go_goroutines 2490   Established 2785
1184	127.0.0.1-> :3001      ESTABLISHED=1184 
75	10.97.82.95-> :3000	     ESTABLISHED=75 
69	10.102.153.135-> :3000	 ESTABLISHED=69 
62	10.101.239.149-> :3000	 ESTABLISHED=62 
55	10.96.21.19-> :3000	     ESTABLISHED=55 
42	10.97.82.93-> :3000	     ESTABLISHED=42 
24	10.97.111.6-> :3000	     ESTABLISHED=24 
23	10.107.202.127-> :3000	 ESTABLISHED=23 
15	10.102.229.196-> :3000	 ESTABLISHED=15

Checking the code I do see a 60 second timeout in a few places (remote converter method, http client Do method) so I'd expect things to get cleaned up rather quickly, not continued growth like we're seeing.

I don't see any pproff endpoints in the 1.6 server, but I hope to be able to get some stack traces via triggering a core dump using pkill -SIGABRT $(pidof drone-server). Hopefully that will help track down what all the extra goroutines and sockets are doing.

Dec 12 '19 00:12 gbrayut

we tagged 1.6.4 which includes the pprof endpoints when you pass DRONE_PPROF_ENABLED=true to the server. I agree this is strange since, as you mentioned, we do place a timeout on requests and we always close the response body. I look forward to seeing what you uncover with pprof enabled. Thanks!

Dec 30 '19 02:12 ashwilliams1

Excellent. I will try and get that setup in a test environment to see if I can repo the issue and track down where the leak is occurring. I was able to use SIGABRT to get a core dump, and found 14425 pairs of go routines (one in select, one in IO wait). But both are in go runtime code so it doesn't point directly to where in drone the issue is occuring.

I made a gist with the deduped output at https://gist.github.com/gbrayut/1e699947a65bbb850492ffb7629060b8 if anyone else wants to try and hunt things down, but pprof will likely be easier to use since it can do more advanced analysis.

Dec 31 '19 02:12 gbrayut

I've been racking my brain trying to think of what could result in a build up of open file descriptors. The only thing I can think of is maybe the fact that we create a new http.Client per request. Maybe it is possible the transport is respecting some sort of keep alive and keeping connections open, which build up over time??

I will patch the code to use a single http.Client instead of re-creating on each request. Even if this does not solve the problem it will lead to less garbage and lower memory so it can't hurt.

https://github.com/drone/drone/blob/c42273084f62309ef3cdce35a985e7f404335558/plugin/converter/remote.go#L60

Jan 02 '20 18:01 bradrydzewski

I pushed the a patch to :latest to re-use the same client for all requests. I am not sure this will make any difference but it is an improvement nevertheless and worth a shot. See https://github.com/drone/drone/commit/f1dd01e50c4eae3e7d5151f4f81a32e90d584ff9

Jan 04 '20 01:01 bradrydzewski

Is there any progress?

Service encountered the same problem when connecting the database

x.x.x.x:3306: connect: cannot assign requested address

May 18 '22 08:05 zzhzero