statsd Statsd-proxy missing metrics

trafficstars

Hi,

We're running Statsd-Proxy on a C4.4xlarge AWS EC2 instance running Xeon E5-2666 v3 (Haswell) with 16 cores relaying traffic to 12 statsd t2.medium instances.

At peak we're processing just over 100K metrics every 10 seconds but we're loosing between 20% and 70% for a few hours every night.

We've been playing around with the cache size and amount of statsd instances and statsd-proxy forks this all works fine during the day without much load but as soon as we start passing any real amount of stats to it over night we start dropping metrics.

Are we doing anything wrong?

We've tuned UDP on the statsd-proxy setting the following values:

sysctl -w net.core.netdev_max_backlog=2000000 sysctl -w net.ipv4.udp_wmem_min=67108864 sysctl -w net.ipv4.udp_rmem_min=67108864 sysctl -w net.core.wmem_max=134217728 sysctl -w net.core.rmem_max=134217728 sysctl -w net.core.rmem_default=134217728 sysctl -w net.core.wmem_default=134217728 sysctl -w net.core.somaxconn=32768 sysctl -w net.core.optmem_max=25165824 sysctl -w net.ipv4.udp_mem='1443354 1924472 12582912'

After doing this we're not seeing any "packet receive errors"

IcmpMsg:
    InType3: 6865
    OutType3: 270670
Udp:
    11278356669 packets received
    12556230 packets to unknown port received.
    0 packet receive errors
    12847117493 packets sent
    SndbufErrors: 254275
UdpLite:
IpExt:
    InOctets: 1197631547451
    OutOctets: 1256880216870
    InNoECTPkts: 11809254448

This is our ProxyConfig.js

nodes: [^M
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126},
 {host: 'xxx.xxx.xxx.xxx', port: 8125, adminport: 8126}
],^M
udp_version: 'udp4',^M
host:  '0.0.0.0',^M
port: 8125,^M
forkCount: 12,^M
checkInterval: 1000,^M
cacheSize: 500000^M
}

This is an extract of the TOP command when the server is under load, doesn't look like the statsd-proxy nodes are doing the same amount of work:

top - 19:07:40 up 7 days,  5:59,  1 user,  load average: 4.12, 3.62, 3.49
Tasks: 163 total,   6 running, 157 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.9%us,  6.9%sy,  0.0%ni, 74.7%id,  0.0%wa,  0.0%hi, 10.9%si,  2.6%st
Mem:  30882320k total,  2803960k used, 28078360k free,   153476k buffers
Swap:        0k total,        0k used,        0k free,  1253640k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13222 root      20   0  790m  97m 8464 S 72.2  0.3   5:37.71 node
13209 root      20   0  789m  96m 8460 S 70.2  0.3   5:32.77 node
13211 root      20   0  791m  97m 8460 R 69.8  0.3   5:25.33 node
13205 root      20   0  790m  96m 8460 S 65.8  0.3   5:16.17 node
   27 root      20   0     0    0    0 R 64.2  0.0 159:41.29 ksoftirqd/5
13216 root      20   0  787m  94m 8460 S 56.9  0.3   4:41.13 node
   43 root      20   0     0    0    0 R 55.9  0.0 572:51.15 ksoftirqd/9
13219 root      20   0  785m  91m 8464 R 54.5  0.3   3:46.58 node
13214 root      20   0  762m  68m 8460 S 53.5  0.2   4:08.05 node
13212 root      20   0  787m  94m 8460 R 51.5  0.3   4:14.51 node
13208 root      20   0  768m  74m 8460 S 45.2  0.2   3:46.18 node
13207 root      20   0  765m  72m 8460 S 40.9  0.2   3:19.62 node
13220 root      20   0  763m  70m 8464 S 40.9  0.2   3:30.28 node
13206 root      20   0  762m  69m 8460 S 40.6  0.2   3:19.76 node
  671 root      20   0     0    0    0 S  0.7  0.0   3:27.26 kworker/5:1
    7 root      20   0     0    0    0 S  0.3  0.0   7:15.86 rcu_sched
  675 root      20   0     0    0    0 S  0.3  0.0   5:43.74 kworker/9:1
21974 root      20   0  137m  20m 1504 S  0.3  0.1  28:49.15 python

Sep 07 '15 13:09 pierrevocat

-_- https://github.com/hit9/statsd-proxy

Sep 14 '15 08:09 hit9

@hit9 Awesome! Thanks!

Aug 10 '16 08:08 Doweig

all works fine during the day without much load but as soon as we start passing any real amount of stats to it over night we start dropping metrics.

I met a similar situation. All backends are removed when I start passing real mount of stats.

https://github.com/statsd/statsd/blob/master/proxy.js#L267 After adding debug logging, I found the cause here. But I totally have no idea what happened there. I use nc to test in the same host and I'm sure the backend is alive. It seems that statsd proxy just can't connect to outside when it received a big amount metrics.

Jan 29 '21 07:01 islue

statsd statsd copied to clipboard

Statsd-proxy missing metrics

statsd
statsd copied to clipboard