Error "connect: cannot assign requested address" due to goroutine / socket leak
We recently upgraded to Drone 1.6.1 running in Kubernetes and just had an issue where all builds were failing with the following error:
Post http://localhost:3001: dial tcp 127.0.0.1:3001: connect: cannot assign requested address
which is the DRONE_CONVERT_PLUGIN_ENDPOINT and in our case is the drone-convert-starlark extension running as a sidecar in the pod spec. Restarting the drone-ci server pod fixed our issue. But having seen that error before (usually indicating exhaustion of ephemeral port range) I started looking into metrics and found what looks like a resource leak related to how the server calls the converter.
process_open_fds{kubernetes_namespace="drone"} grew to 28267 over ~22 days
go_goroutines{kubernetes_namespace="drone"} grew to 56547 indicating a potential issue with goroutines not returning/finishing
Investigating the open connections inside the new pod we saw the connections from localhost to the converter service on 127.0.0.1:3001 were often (but not always) staying in the ESTABLISHED state. Established connections to the secret endpoint (drone-vault in our case) also may be leaking but at a much slower pace.
Here you can see 4 samples each taken 10 minutes apart, showing src IPs and the local listening port they are connected to. In this data all the sockets are in the ESTABLISHED state, and you can also see the go_goroutines gauge metric is steadily increasing.
/ # for run in `seq 1 10`; do uptime;netstat -nat > /dev/shm/netstat; echo "Totals: $(curl -s 127.0.0.1:80/metrics|grep '^go_goroutines') Established $(grep 'ESTABLISHED' /dev/shm/netstat | wc -l)"; gawk 'BEGIN {OFS="\t"} $1 ~ /tcp/ && $6 !~ /LISTEN/ && $4 ~ /:(3000|3001)$/{ndst=split($4,dst,":");nsrc=split($5,src,":");groups[(src[nsrc-1] "-> :" dst[ndst])]++;states[(src[nsrc-1] "-> :" dst[ndst])][$6]++}END{PROCINFO["sorted_in"] = "@val_num_desc";for (g in groups) if (groups[g]>5) {statelist=" ";for (s in states[g]) statelist=statelist s "=" states[g][s] " ";print groups[g],g,statelist}}' /dev/shm/netstat; sleep 600;done;
23:50:40 up 6 days, 6:07, load average: 0.04, 0.75, 2.24
Totals: go_goroutines 2295 Established 2545
1088 127.0.0.1-> :3001 ESTABLISHED=1088
75 10.97.82.95-> :3000 ESTABLISHED=75
69 10.102.153.135-> :3000 ESTABLISHED=69
57 10.101.239.149-> :3000 ESTABLISHED=57
48 10.96.21.19-> :3000 ESTABLISHED=48
37 10.97.82.93-> :3000 ESTABLISHED=37
17 10.97.111.6-> :3000 ESTABLISHED=17
16 10.107.202.127-> :3000 ESTABLISHED=16
7 10.102.229.196-> :3000 ESTABLISHED=7
00:00:41 up 6 days, 6:17, load average: 0.00, 0.08, 1.15
Totals: go_goroutines 2350 Established 2637
1136 127.0.0.1-> :3001 ESTABLISHED=1136
77 10.97.82.95-> :3000 ESTABLISHED=77
70 10.102.153.135-> :3000 ESTABLISHED=70
55 10.101.239.149-> :3000 ESTABLISHED=55
51 10.96.21.19-> :3000 ESTABLISHED=51
38 10.97.82.93-> :3000 ESTABLISHED=38
22 10.97.111.6-> :3000 ESTABLISHED=22
18 10.107.202.127-> :3000 ESTABLISHED=18
9 10.102.229.196-> :3000 ESTABLISHED=9
00:10:41 up 6 days, 6:27, load average: 0.20, 0.10, 0.63
Totals: go_goroutines 2459 Established 2746
1167 127.0.0.1-> :3001 ESTABLISHED=1167
76 10.97.82.95-> :3000 ESTABLISHED=76
69 10.102.153.135-> :3000 ESTABLISHED=69
59 10.101.239.149-> :3000 ESTABLISHED=59
52 10.96.21.19-> :3000 ESTABLISHED=52
42 10.97.82.93-> :3000 ESTABLISHED=42
24 10.97.111.6-> :3000 ESTABLISHED=24
21 10.107.202.127-> :3000 ESTABLISHED=21
13 10.102.229.196-> :3000 ESTABLISHED=13
00:20:42 up 6 days, 6:37, load average: 0.19, 0.32, 0.47
Totals: go_goroutines 2490 Established 2785
1184 127.0.0.1-> :3001 ESTABLISHED=1184
75 10.97.82.95-> :3000 ESTABLISHED=75
69 10.102.153.135-> :3000 ESTABLISHED=69
62 10.101.239.149-> :3000 ESTABLISHED=62
55 10.96.21.19-> :3000 ESTABLISHED=55
42 10.97.82.93-> :3000 ESTABLISHED=42
24 10.97.111.6-> :3000 ESTABLISHED=24
23 10.107.202.127-> :3000 ESTABLISHED=23
15 10.102.229.196-> :3000 ESTABLISHED=15
Checking the code I do see a 60 second timeout in a few places (remote converter method, http client Do method) so I'd expect things to get cleaned up rather quickly, not continued growth like we're seeing.
I don't see any pproff endpoints in the 1.6 server, but I hope to be able to get some stack traces via triggering a core dump using pkill -SIGABRT $(pidof drone-server). Hopefully that will help track down what all the extra goroutines and sockets are doing.
we tagged 1.6.4 which includes the pprof endpoints when you pass DRONE_PPROF_ENABLED=true to the server. I agree this is strange since, as you mentioned, we do place a timeout on requests and we always close the response body. I look forward to seeing what you uncover with pprof enabled. Thanks!
Excellent. I will try and get that setup in a test environment to see if I can repo the issue and track down where the leak is occurring. I was able to use SIGABRT to get a core dump, and found 14425 pairs of go routines (one in select, one in IO wait). But both are in go runtime code so it doesn't point directly to where in drone the issue is occuring.
I made a gist with the deduped output at https://gist.github.com/gbrayut/1e699947a65bbb850492ffb7629060b8 if anyone else wants to try and hunt things down, but pprof will likely be easier to use since it can do more advanced analysis.
I've been racking my brain trying to think of what could result in a build up of open file descriptors. The only thing I can think of is maybe the fact that we create a new http.Client per request. Maybe it is possible the transport is respecting some sort of keep alive and keeping connections open, which build up over time??
I will patch the code to use a single http.Client instead of re-creating on each request. Even if this does not solve the problem it will lead to less garbage and lower memory so it can't hurt.
https://github.com/drone/drone/blob/c42273084f62309ef3cdce35a985e7f404335558/plugin/converter/remote.go#L60
I pushed the a patch to :latest to re-use the same client for all requests. I am not sure this will make any difference but it is an improvement nevertheless and worth a shot. See https://github.com/drone/drone/commit/f1dd01e50c4eae3e7d5151f4f81a32e90d584ff9
Is there any progress?
Service encountered the same problem when connecting the database
x.x.x.x:3306: connect: cannot assign requested address