k6 icon indicating copy to clipboard operation
k6 copied to clipboard

Statsd seems to be dropping metrics probably because of UDP

Open mstoykov opened this issue 3 years ago • 18 comments

Discussed in forum thread.

My last comment confirms that:

  1. running between dockers is worse
  2. fixing MTU in docker helps
  3. lowering push interval (effectively always pushing smaller packets) helps more and effectively fixes it in my case, although this shouldn't be needed as the lib should be doing the same :shrug:
  4. updating datadog-go makes it slightly worse :facepalm:
  5. k6.v0.29.0(random older version I tried) is even worse. :facepalm:

Possible actions:

  1. lower the push interval - this seems like a good fast solution, given that we are sending metrics as udp packets and they have a small limit, k6 not doing it more regularly doesn't really help anything IMO
  2. investigate this further, it might not be UDP related at all, at which point switching statsd libraries to something with possibly tcp support as well might not even help.
  3. investigate datadog-go + aggregation in datadog-go ... maybe it will help :shrug:

mstoykov avatar May 28 '21 12:05 mstoykov

Is there any update on this?

shekharshukla avatar Jul 08 '22 14:07 shekharshukla

If there is an update we will be updating the issue. Do you want to work on it?

mstoykov avatar Jul 11 '22 08:07 mstoykov

I'm having the same problem with that :(

K6 output: image

DataDog: image

rodrigotheodoropsv avatar Oct 26 '22 18:10 rodrigotheodoropsv

Any updates about this issue? I'm facing the same behavior.

Idosegal avatar May 17 '23 16:05 Idosegal

Same here

DanielCastronovo avatar May 22 '23 17:05 DanielCastronovo

I've just changed where I run the datadog-agent and changed some settings of the SO, it worked for me. Now I run the agent in the same machine that I run the K6, so I use the loopback to send the metrics from k6 to datadog-agent and send it to the DataDog normally.

rodrigotheodoropsv avatar May 23 '23 16:05 rodrigotheodoropsv

@rodrigotheodoropsv - That's great that you got it working. Can you tell me what settings were changed to get it to work?

bgallagher88 avatar May 23 '23 18:05 bgallagher88

We use a VM with Ubuntu, but I don't know the version and more details about it because the team of DevOps who take care about it.

Ps: There are a lot of settings that we use for other things.

The settings are:

net.core.somaxconn | 65535 net.ipv4.tcp_max_tw_buckets | 1440000 net.ipv4.tcp_max_syn_backlog | 20000 net.ipv4.ip_local_port_range | 1024 65535 net.ipv4.tcp_fin_timeout | 2 net.ipv4.tcp_slow_start_after_idle | 0 vm.swappiness | 1 vm.overcommit_memory | 1 net.ipv4.tcp_wmem | 4096 12582912 16777216 net.ipv4.tcp_rmem | 4096 12582912 16777216 net.ipv4.tcp_retries2 | 5 net.ipv4.tcp_tw_reuse | 1 net.ipv6.conf.all.disable_ipv6 | 1 net.ipv6.conf.default.disable_ipv6 | 1 vm.max_map_count | 999999 fs.nr_open | 100000000 fs.file-max | 100000000 kernel.pid_max | 3999999 vm.dirty_expire_centisecs | 500 vm.dirty_writeback_centisecs | 100 vm.dirty_ratio | 15 vm.dirty_background_ratio | 5 net.ipv4.tcp_keepalive_intvl | 10 net.ipv4.tcp_keepalive_time | 30 net.ipv4.tcp_keepalive_probes | 4 net.core.rmem_default | 33554432 net.core.wmem_default | 33554432 net.core.rmem_max | 33554432 net.core.wmem_max | 33554432 net.core.netdev_max_backlog | 2500

rodrigotheodoropsv avatar May 24 '23 19:05 rodrigotheodoropsv

Does anyone experienced the same in a containerized environment?

Idosegal avatar May 25 '23 18:05 Idosegal

I tried to investigate the issue and my initial hunch is that the problem is not related to udp. It seems that statsd could not handle incoming packets fast enough.

I first tried netcat to verify that packets are delivered to statsd host (Neither tcpdump nor a package manager exists on statsd-exporter).

I changed the statsd address to port 9127 on k6 container definition:

K6_STATSD_ADDR: "statsdex:9127"

I also added some sleep to command so that it waits until nc is started on statsd-exporter.

On statsd-exporter, I started nc: nc -l -u -p 9127 > packets.txt

After the end of the test request count reported by k6 matches the "k6.http_reqs" count on packets.txt:

$ grep -c k6.http_reqs out2.txt 
13786

k6_1 | http_reqs......................: 13786 229.537003/s

On the other hand, when I use statsd as output address, on its own udp receive metrics, it reports less packets than the ones captured by tcpdump on k6 host: UDP packets from pcap (udp.dstport == 9125): 21535

$ curl -s http://localhost:9102/metrics | grep udp
# HELP statsd_exporter_udp_packets_total The total number of StatsD packets received over UDP.
# TYPE statsd_exporter_udp_packets_total counter
statsd_exporter_udp_packets_total 21272

When I checked the statds_export code (https://github.com/prometheus/statsd_exporter/blob/871e2d8df1da51d4eed27738f4bc079271a09c61/pkg/listener/listener.go#L54-L82), it seems that instead of processing the packet in a seperate goroutine/thread, it first parses the packet to events, then sends them to a channel in the udp handling goroutine/thread. Parse operation might take a lot of time depending on the metrics delivered and this may cause packet drops.

It seems that k6 sends a large UDP packet:

k6.vus:15.000000|g k6.vus_max:50.000000|g k6.http_reqs:1|c|#name:https://test.k6.io/,tls_version:tls1.3,scenario:default,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1,expected_response:true k6.http_req_duration:146.138629|ms|#name:https://test.k6.io/,tls_version:tls1.3,proto:HTTP/1.1,scenario:default,expected_response:true,url:https://test.k6.io/,method:GET,status:200 k6.http_req_blocked:3677.270817|ms|#expected_response:true,scenario:default,name:https://test.k6.io/,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1,tls_version:tls1.3 k6.http_req_connecting:12.663258|ms|#status:200,expected_response:true,url:https://test.k6.io/,name:https://test.k6.io/,scenario:default,proto:HTTP/1.1,method:GET,tls_version:tls1.3 k6.http_req_tls_handshaking:3607.250707|ms|#expected_response:true,scenario:default,name:https://test.k6.io/,tls_version:tls1.3,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1 k6.http_req_sending:0.178607|ms|#url:https://test.k6.io/,proto:HTTP/1.1,expected_response:true,tls_version:tls1.3,method:GET,status:200,scenario:default,name:https://test.k6.io/ k6.http_req_waiting:145.692124|ms|#name:https://test.k6.io/,proto:HTTP/1.1,status:200,url:https://test.k6.io/,method:GET,tls_version:tls1.3,scenario:default,expected_response:true

This is my initial investigation, it may not be 100% percent correct though :)

kullanici0606 avatar Sep 07 '23 14:09 kullanici0606

Lowering flush interval helps not because packets are smaller (tcpdump shows similar packet sizes, i.e around 1400 bytes) but packets are not sent in bursts when interval is smaller.

I mean when I compared network captures of 1s and 100ms interval, it shows this (I am summarizing):

1 second interval: 564 packets in 40 ms -> 14,1 packets/ms
100 ms interval: 23 packets in 5 ms -> 4,6 packets/ms

So higher flush interval means more packets to process by statsd_exporter with less time, therefore if it cannot catch up, it drops packets.

Before going further, could you please tell me which direction to pursue?

Newer statsd (datadog-go) client has a feature called "ClientSideAggregation" which reduces the number of packets sent to statsd by aggregating them on client side, however k6's dependency policy explicitly states that it is not prefferred to update datadog-go dependency, therefore going on that direction is not feasible I think

kullanici0606 avatar Sep 08 '23 06:09 kullanici0606

I missed the references pull request (https://github.com/alphagov/di-devplatform-performance/pull/122) which also lowers the flush interval.

I also tried changing statsd_exporter logic to process udp packet in a different goroutine, which helped a lot (no packet drop occurred for several attempts):

https://github.com/kullanici0606/statsd_exporter/tree/enqueu_udp_packets

I think, lowering the flush interval to 100ms, increasing the buffer size and changing statds_exporter logic a little bit will be the easiest and the best path for solution.

kullanici0606 avatar Sep 08 '23 14:09 kullanici0606

The version of the k6 in the docker file on the forum is 0.32.0 and on v.0.33.0 some tags are blacklisted / blocked so that UDP packets are smaller:

https://github.com/grafana/k6/issues/2048

https://github.com/grafana/k6/blob/3655065227b09ee368c29fb3dde7650761e22f68/release%20notes/v0.33.0.md#breaking-changes

So upgrading k6 to a version newer than v0.33.0 will definitely help. To sum up:

  • Upgrade to v0.33.0 or later
  • set K6_STATSD_BUFFER_SIZE to 100ms or less (K6_STATSD_PUSH_INTERVAL: 100ms in docker-compose.yml)

will reduce the probability of packet drops.

@mstoykov Do you think this issue needs further investigation?

I tried implementing part of the client side aggregation on k6 by decorating / wrapping statsd client but it seems that most of the metrics are trends / timing information which cannot be aggregated easily. Counter is easy but there is only one counter, so it helped a little when I tried aggregating it.

https://github.com/grafana/k6/blob/3655065227b09ee368c29fb3dde7650761e22f68/metrics/builtin.go#L87-L95

kullanici0606 avatar Sep 20 '23 15:09 kullanici0606

@kullanici0606 This issue will likely be moved along all other statsd issues "soon" :tm: to the https://github.com/LeonAdato/xk6-output-statsd as part of #2982.

I guess v0.33.0 helped a lot, but this still will happen with enough requests from my memories when I last tested it.

mstoykov avatar Sep 20 '23 16:09 mstoykov

Yes, even with v0.33.0, problem occurs sometimes, therefore I also tried to do some improvements on statsd_exporter side too: https://github.com/prometheus/statsd_exporter/pull/511

@mstoykov should I stop working on this issue then?

kullanici0606 avatar Sep 20 '23 16:09 kullanici0606

should I stop working on this issue then?

I would recommend either waiting for the final move of issues or opening an issue with the new repo.

The k6 core team will in practice have no connection with this and as part of that I will not tell anyone how and what to do with the extension :)

mstoykov avatar Sep 21 '23 07:09 mstoykov