telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

inputs.statsd - Increasing number of UDP pending messages dropped

Open gregarndt opened this issue 1 year ago • 16 comments

Please direct all support questsions to slack or the forums. Thank you.

Opening up this ticket in response to discussing the issue on slack. The issue is that despite setting allowed pending messages to a very large number like 1.5M, we're still dropping UDP packages and unsure why. There was the suggestion that perhaps there were not enough parsers available to drain the channel before packets are dropped. It seems that once about 500k messages are received we start dropping packets. image

gregarndt avatar Nov 30 '22 21:11 gregarndt

Hello! I recommend posting this question in our Community Slack or Community Page, we have a lot of talented community members there who could help answer your question more quickly. You can also learn more about Telegraf by enrolling at InfluxDB University for free!

Heads up, this issue will be automatically closed after 7 days of inactivity. Thank you!

telegraf-tiger[bot] avatar Nov 30 '22 21:11 telegraf-tiger[bot]

@gregarndt can you please add your (redacted) config for the statsd plugin?!

srebhan avatar Nov 30 '22 21:11 srebhan

Here is our statsd config:

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)


# Global tags can be specified here in key="value" format.
[global_tags]
  #dc = "us-east-1" # will tag all metrics with dc=us-east-1
  #cloud-provider = "aws"
  metric-agent = "statsd"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"


# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 100000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 1000000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "5s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "5s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  #omit_hostname = false
  omit_hostname = true


###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  ## The full HTTP or UDP URL for your InfluxDB instance.
  ##
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://127.0.0.1:8089"] # UDP endpoint example
  urls = ["http://172.23.0.9:8086"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "redox" # required

  ## Name of existing retention policy to write to.  Empty string writes to
  ## the default retention policy.
  retention_policy = ""
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  # username = "telegraf"
  # password = "metricsmetricsmetricsmetrics"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512

  ## Optional SSL Config
  # ssl_ca = "/etc/telegraf/ca.pem"
  # ssl_cert = "/etc/telegraf/cert.pem"
  # ssl_key = "/etc/telegraf/key.pem"
  ## Use SSL but skip chain & host verification
  # insecure_skip_verify = false

  ## HTTP Proxy Config
  # http_proxy = "http://corporate.proxy:3128"

  ## Optional HTTP headers
  # http_headers = {"X-Special-Header" = "Special-Value"}

  ## Compress each HTTP request payload using GZIP.
  content_encoding = "gzip"


###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################



###############################################################################
#                            SERVICE INPUT PLUGINS                            #
###############################################################################

# # Statsd UDP/TCP Server
[[inputs.statsd]]
#   ## Protocol, must be "tcp", "udp", "udp4" or "udp6" (default=udp)
  protocol = "udp"
#
#   ## MaxTCPConnection - applicable when protocol is set to tcp (default=250)
#   max_tcp_connections = 250
#
#   ## Address and port to host UDP listener on
  service_address = ":8125"
#
#   ## The following configuration options control when telegraf clears it's cache
#   ## of previous values. If set to false, then telegraf will only clear it's
#   ## cache when the daemon is restarted.
#   ## Reset gauges every interval (default=true)
#   delete_gauges = true
#   ## Reset counters every interval (default=true)
#   delete_counters = true
#   ## Reset sets every interval (default=true)
#   delete_sets = true
#   ## Reset timings & histograms every interval (default=true)
#   delete_timings = true
#
#   ## Percentiles to calculate for timing & histogram stats
#   percentiles = [90]
#
#   ## separator to use between elements of a statsd metric
  metric_separator = "."
#
#  ## Parses extensions to statsd in the datadog statsd format
#  ## currently supports metrics and datadog tags.
#  ## http://docs.datadoghq.com/guides/dogstatsd/
# datadog_extensions = false
#
#  ## Parses distributions metric as specified in the datadog statsd format
#  ## https://docs.datadoghq.com/developers/metrics/types/?tab=distribution#definition
# datadog_distributions = false
#   ## Statsd data translation templates, more info can be read here:
#   ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md#graphite
#   # templates = [
#   #     "cpu.* measurement*"
#   # ]
  templates = [
    "pubsub.rest-proxy-request.count.* measurement.measurement.measurement.field"
  ]
#
#   ## Number of UDP messages allowed to queue up, once filled,
#   ## the statsd server will start dropping packets
  allowed_pending_messages = 1500000
#
#   ## Number of timing/histogram values to track per-measurement in the
#   ## calculation of percentiles. Raising this limit increases the accuracy
#   ## of percentiles but also increases the memory usage and cpu time.
#   percentile_limit = 1000
  #read_buffer_size = 1048576

# Collect statistics about itself
[[inputs.internal]]
  ## If true, collect telegraf memory stats.
  # collect_memstats = true

[[outputs.health]]
  ## Address and port to listen on.
  ##   ex: service_address = "http://localhost:8080"
  ##       service_address = "unix:///var/run/telegraf-health.sock"
  service_address = "http://:8080"

  ## The maximum duration for reading the entire request.
  # read_timeout = "5s"
  ## The maximum duration for writing the entire response.
  # write_timeout = "5s"

  ## Username and password to accept for HTTP basic authentication.
  # basic_username = "user1"
  # basic_password = "secret"

  ## Allowed CA certificates for client certificates.
  # tls_allowed_cacerts = ["/etc/telegraf/clientca.pem"]

  ## TLS server certificate and private key.
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"

  ## One or more check sub-tables should be defined, it is also recommended to
  ## use metric filtering to limit the metrics that flow into this output.
  ##
  ## When using the default buffer sizes, this example will fail when the
  ## metric buffer is half full.
  ##
  ## namepass = ["internal_write"]
  ## tagpass = { output = ["influxdb"] }
  ##
  ## [[outputs.health.compares]]
  ##   field = "buffer_size"
  ##   lt = 5000.0
  ##
  ## [[outputs.health.contains]]
  ##   field = "buffer_size"

gregarndt avatar Nov 30 '22 21:11 gregarndt

This drop starting from around 1.5M seems to match exactly this config:

#   ## Number of UDP messages allowed to queue up, once filled,
#   ## the statsd server will start dropping packets
  allowed_pending_messages = 1500000

Hipska avatar Dec 01 '22 09:12 Hipska

I appreciate you calling that out. We've been playing with that going from the original 10k default and have since increased it to 1.5M and still telegraf is dropping packets. I'm at a loss if this number needs to go up even more and we need to throw more machine at it (so far 8cpu 16gb machine) or is there a bottleneck on the other side. It was suggested on slack that the static threshold of 5 parsers may be a bottleneck for draining things quick enough.

gregarndt avatar Dec 01 '22 16:12 gregarndt

@gregarndt how many statsd metrics are you sending per second? I'm asking because if the rate of incoming lines is (persistently) larger than the rate of parsing, you will overflow sooner or later no matter which level you set...

srebhan avatar Dec 01 '22 16:12 srebhan

Do you also have some interesting metrics from the internal stats coming from the statsd input? That might also give some insights..

Hipska avatar Dec 01 '22 16:12 Hipska

@Hipska I'm about to add the number of pending messages to the (internal) stats.

srebhan avatar Dec 01 '22 16:12 srebhan

@gregarndt can you please test #12318 and monitor the new pending_messages and the existingparse_time_ns field in internal_statsd and post a graph or some numbers here?!?

srebhan avatar Dec 01 '22 17:12 srebhan

@srebhan here's some new graphs for you image

gregarndt avatar Dec 01 '22 20:12 gregarndt

Here's a maybe more interesting view from a window with 30 minutes of data instead of the one I posted earlier image

gregarndt avatar Dec 01 '22 20:12 gregarndt

So to me it seems like the parsing rate is much lower than the message rate. It is noteworthy that this single peak is around 50 seconds high! sigh

Some things that would help:

  1. Can you please check the median parsing time to eliminate the outlier?
  2. Can you please check what your average and median message rate, i.e. the number of UDP messages that arrive in a certain time e.g. per second?
  3. Is the system at the limit, i.e. is there free CPU time, as we can try to make the number of parsing threads an config-option?

srebhan avatar Dec 02 '22 10:12 srebhan

I took a new 15 minute sample posted below that doesn't have an outlier. I I got the metrics you're looking for (used udp_packets_received which is what I think you wanted for #2). I noticed that our datapoints are for every 10 seconds (how often we flush them to influx)

image

gregarndt avatar Dec 02 '22 18:12 gregarndt

Could any sort of backpressure from flushing to influx cause issues with ingestion? I know at one time there was a bug with that, but I thought it was fixed and did not block each other.

gregarndt avatar Dec 02 '22 18:12 gregarndt

Anything else I could supply here? happy to try a build with more parsers if you think that's the bottleneck.

gregarndt avatar Dec 05 '22 18:12 gregarndt

+1

liorfranko avatar Jan 05 '23 06:01 liorfranko

Anything else I could supply here? happy to try a build with more parsers if you think that's the bottleneck.

After talking about this, it sounds like the current implementation in Telegraf is not fast enough for your use case. Are you aware of any other implementation that is or that you were using before?

powersj avatar Feb 15 '23 20:02 powersj