statsd
statsd copied to clipboard
Statsd-proxy missing metrics
Hi,
We're running Statsd-Proxy on a C4.4xlarge AWS EC2 instance running Xeon E5-2666 v3 (Haswell) with 16 cores relaying traffic to 12 statsd t2.medium instances.
At peak we're processing just over 100K metrics every 10 seconds but we're loosing between 20% and 70% for a few hours every night.
We've been playing around with the cache size and amount of statsd instances and statsd-proxy forks this all works fine during the day without much load but as soon as we start passing any real amount of stats to it over night we start dropping metrics.
Are we doing anything wrong?
We've tuned UDP on the statsd-proxy setting the following values:
sysctl -w net.core.netdev_max_backlog=2000000 sysctl -w net.ipv4.udp_wmem_min=67108864 sysctl -w net.ipv4.udp_rmem_min=67108864 sysctl -w net.core.wmem_max=134217728 sysctl -w net.core.rmem_max=134217728 sysctl -w net.core.rmem_default=134217728 sysctl -w net.core.wmem_default=134217728 sysctl -w net.core.somaxconn=32768 sysctl -w net.core.optmem_max=25165824 sysctl -w net.ipv4.udp_mem='1443354 1924472 12582912'
After doing this we're not seeing any "packet receive errors"
IcmpMsg:
InType3: 6865
OutType3: 270670
Udp:
11278356669 packets received
12556230 packets to unknown port received.
0 packet receive errors
12847117493 packets sent
SndbufErrors: 254275
UdpLite:
IpExt:
InOctets: 1197631547451
OutOctets: 1256880216870
InNoECTPkts: 11809254448
This is our ProxyConfig.js
nodes: [^M
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
{host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126}
],^M
udp_version: 'udp4',^M
host: '0.0.0.0',^M
port: 8125,^M
forkCount: 12,^M
checkInterval: 1000,^M
cacheSize: 500000^M
}
This is an extract of the TOP command when the server is under load, doesn't look like the statsd-proxy nodes are doing the same amount of work:
top - 19:07:40 up 7 days, 5:59, 1 user, load average: 4.12, 3.62, 3.49
Tasks: 163 total, 6 running, 157 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.9%us, 6.9%sy, 0.0%ni, 74.7%id, 0.0%wa, 0.0%hi, 10.9%si, 2.6%st
Mem: 30882320k total, 2803960k used, 28078360k free, 153476k buffers
Swap: 0k total, 0k used, 0k free, 1253640k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13222 root 20 0 790m 97m 8464 S 72.2 0.3 5:37.71 node
13209 root 20 0 789m 96m 8460 S 70.2 0.3 5:32.77 node
13211 root 20 0 791m 97m 8460 R 69.8 0.3 5:25.33 node
13205 root 20 0 790m 96m 8460 S 65.8 0.3 5:16.17 node
27 root 20 0 0 0 0 R 64.2 0.0 159:41.29 ksoftirqd/5
13216 root 20 0 787m 94m 8460 S 56.9 0.3 4:41.13 node
43 root 20 0 0 0 0 R 55.9 0.0 572:51.15 ksoftirqd/9
13219 root 20 0 785m 91m 8464 R 54.5 0.3 3:46.58 node
13214 root 20 0 762m 68m 8460 S 53.5 0.2 4:08.05 node
13212 root 20 0 787m 94m 8460 R 51.5 0.3 4:14.51 node
13208 root 20 0 768m 74m 8460 S 45.2 0.2 3:46.18 node
13207 root 20 0 765m 72m 8460 S 40.9 0.2 3:19.62 node
13220 root 20 0 763m 70m 8464 S 40.9 0.2 3:30.28 node
13206 root 20 0 762m 69m 8460 S 40.6 0.2 3:19.76 node
671 root 20 0 0 0 0 S 0.7 0.0 3:27.26 kworker/5:1
7 root 20 0 0 0 0 S 0.3 0.0 7:15.86 rcu_sched
675 root 20 0 0 0 0 S 0.3 0.0 5:43.74 kworker/9:1
21974 root 20 0 137m 20m 1504 S 0.3 0.1 28:49.15 python
-_- https://github.com/hit9/statsd-proxy
@hit9 Awesome! Thanks!
all works fine during the day without much load but as soon as we start passing any real amount of stats to it over night we start dropping metrics.
I met a similar situation. All backends are removed when I start passing real mount of stats.
https://github.com/statsd/statsd/blob/master/proxy.js#L267 After adding debug logging, I found the cause here. But I totally have no idea what happened there. I use nc to test in the same host and I'm sure the backend is alive. It seems that statsd proxy just can't connect to outside when it received a big amount metrics.