etcd-mesos
etcd-mesos copied to clipboard
etcd-mesos scheduler doesnt clean up healthcheck tcp socket
etcd version 2.2.3, mesosphere/etcd-mesos:0.1.3
etcd-mesos container leaves open sockets (we saw over 42,000k) and this eats up ports and bleeds into other port ranges eventually killing Spartan and therefore the worker nodes health.
1026 / 1027 is a ports running etcd
{
"name": "_etcd-server._client.etcd-ptx.mesos.",
"host": "etcd-server-8hw4j-s4.etcd-ptx.mesos.:1026",
"rtype": "SRV"
}
{
"name": "_etcd-server._client.etcd-ptx.mesos.",
"host": "etcd-server-yemwe-s0.etcd-ptx.mesos.:1027",
"rtype": "SRV"
}
{
"name": "_etcd-server._client.etcd-ptx.mesos.",
"host": "etcd-server-6rfx8-s6.etcd-ptx.mesos.:1026",
"rtype": "SRV"
}
$ sudo lsof -i -P -n | grep -oc :1026
10527
$ sudo lsof -i -P -n | grep -oc :1027
3905
You can see 10k open, we saw spartan killed at 40K
$ sudo lsof -i -P -n | grep 1026
px 131653 root 5u IPv4 73405621 0t0 TCP 10.251.206.22:36338->10.251.206.17:1026 (ESTABLISHED)
px 131653 root 7u IPv4 73394523 0t0 TCP 10.251.206.22:36340->10.251.206.17:1026 (ESTABLISHED)
px 131653 root 24u IPv4 74751557 0t0 TCP 10.251.206.22:37012->10.251.206.17:1026 (ESTABLISHED)
px 131653 root 26u IPv4 73414087 0t0 TCP 10.251.206.22:36470->10.251.206.17:1026 (ESTABLISHED)
px 131653 root 30u IPv4 74757811 0t0 TCP 10.251.206.22:37640->10.251.206.17:1026 (ESTABLISHED)
px 131653 root 31u IPv4 74740409 0t0 TCP 10.251.206.22:36816->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 13u IPv4 74566478 0t0 TCP 10.251.206.22:37786->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 15u IPv4 74693374 0t0 TCP 10.251.206.22:37873->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 17u IPv4 74693376 0t0 TCP 10.251.206.22:38010->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 18u IPv4 74566510 0t0 TCP 10.251.206.22:58952->10.251.206.13:1026 (ESTABLISHED)
etcd-meso 168715 root 20u IPv4 74783939 0t0 TCP 10.251.206.22:59134->10.251.206.13:1026 (ESTABLISHED)
etcd-meso 168715 root 21u IPv4 74786210 0t0 TCP 10.251.206.22:59136->10.251.206.13:1026 (ESTABLISHED)
etcd-meso 168715 root 22u IPv4 74768044 0t0 TCP 10.251.206.22:38198->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 24u IPv4 74759147 0t0 TCP 10.251.206.22:38514->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 27u IPv4 74755262 0t0 TCP 10.251.206.22:59608->10.251.206.13:1026 (ESTABLISHED)
etcd-meso 168715 root 30u IPv4 74793585 0t0 TCP 10.251.206.22:38952->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 31u IPv4 74800159 0t0 TCP 10.251.206.22:39084->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 32u IPv4 74800160 0t0 TCP 10.251.206.22:60026->10.251.206.13:1026 (ESTABLISHED)
etcd-meso 168715 root 33u IPv4 74800195 0t0 TCP 10.251.206.22:39206->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 34u IPv4 74800196 0t0 TCP 10.251.206.22:60150->10.251.206.13:1026 (ESTABLISHED)
etcd-meso 168715 root 35u IPv4 74800228 0t0 TCP 10.251.206.22:39346->10.251.206.17:1026 (ESTABLISHED)
etcd-meso 168715 root 36u IPv4 74800229 0t0 TCP 10.251.206.22:60288->10.251.206.13:1026 (ESTABLISHED)
etcd-meso 168715 root 37u IPv4 74801560 0t0 TCP 10.251.206.22:39488->10.251.206.17:1026 (ESTABLISHED)```
I wrote a quick fix: https://github.com/minyk/etcd-mesos/commit/2b54e65119aed4c8ea5112c8f3927fab80194672 Just add client.Close(), then connection numbers are very stable(~20) for now.