telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

inputs.consul: Connections to consul reach maximum allowed

Open thias opened this issue 2 years ago • 8 comments

Relevant telegraf.conf

[[inputs.consul]]
address = "127.0.0.1:8500"
namedrop = ["consul_health_checks"]
scheme = "http"
tagexclude = ["project_*"]
[inputs.consul.tags]
datacenter = "foo"
influxdb_database = "consul"

Logs from Telegraf

2022-10-31T14:17:38Z W! [inputs.consul] Use of deprecated configuration: 'metric_version = 1'; please update to 'metric_version = 2'

System info

telegraf-1.24.2-1.x86_64 rpm on RHEL 7 & 8

Docker

No response

Steps to reproduce

  1. lsof -nP | grep 127.0.0.1:8500 | cut -d ' ' -f 1 | sort | uniq -c
  2. kill -HUP `/sbin/pidof /usr/bin/telegraf`

Repeat the above over an over.

Expected behavior

The initial values should stay stable, like they do when no SIGHUP is being sent:

     24 consul
     12 telegraf

... re-running the lsof command shows no increase, even after hours or days.

Actual behavior

After each SIGHUP, both counts grow after a few seconds:

     36 consul
     24 telegraf
     48 consul
     39 telegraf
     60 consul
     52 telegraf
     72 consul
     65 telegraf

Additional info

This seems to have already been reported in #7554 but closed as having gone away. We have been seeing this behavior with various versions of consul (1.11, 1.12 and the latest 1.13) as well as with various versions of telegraf (1.22 and now 1.24 to check it was reproducible with the latest version). The current consul default is to allow only 200 connections from the same IP address. Because SIGHUP is being sent to telegraf each day after rotating logs in our environment, we sometimes end up with consul no longer accepting any connections from 127.0.0.1, triggering our monitoring alerts and breaking some scripts. Restarting telegraf fixes the problem.

thias avatar Oct 31 '22 14:10 thias

Because SIGHUP is being sent to telegraf each day after rotating logs in our environment Restarting telegraf fixes the problem.

Out of curiosity, why are you using SIGHUP over say a service restart?

After each SIGHUP, both counts grow after a few seconds:

When Telegraf gets a SIGHUP it will attempt to stop service inputs and running outputs. The console plugin is not a service input, so there is not attempt at any clean up. Restarting helps since it is an entirely new process. I also don't see a close method in their API agent.

powersj avatar Oct 31 '22 15:10 powersj

Out of curiosity, why are you using SIGHUP over say a service restart?

Mostly out of habit, I guess. Daemons that can log to files often support reopening them after receiving a signal, usually USR1 or HUP, and with HUP telegraf does indeed reopen them (in addition to reloading its config, so we also send HUP when we change its configuration). And although telegraf seems to be able to rotate files on its own, we tend to prefer having a single mechanism system-wide, so that things like filename suffixes and compression methods are consistent. We could do a full service restart, but that's usually overkill: Depending on the service it could mean some interruption and downtime, which a simple file reopening doesn't cause. And we monitor running processes, so a full restart would cause a race condition where we could detect telegraf as not running.

After digging some more, I just saw that the original telegraf rpm has a "copytruncate" based logrotate entry. This always seems sub-optimal, but it could mean that there's a reason for it? Is it just luck that both the agent logfile and all outputs.file get reopened with a SIGHUP?

Maybe I should request for SIGUSR1 to make telegraf reopen its log files? That would sweep this issue under the rug for me :smile:

In any case, I think the original issue should still be considered a bug, as it can also be triggered by running systemctl reload telegraf, so frequent configuration updates could also lead to it. If it's an issue/limitation in some consul client library/binding, just let me know and I can report the issue against it.

thias avatar Oct 31 '22 16:10 thias

In any case, I think the original issue should still be considered a bug, as it can also be triggered by running systemctl reload telegraf, so frequent configuration updates could also lead to it. If it's an issue/limitation in some consul client library/binding, just let me know and I can report the issue against it.

Please do. I would like to see how upstream responds to this and if there is something we should be doing different. Also please do let us know the upstream issue # here. Thanks!

powersj avatar Oct 31 '22 16:10 powersj

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!

telegraf-tiger[bot] avatar Nov 14 '22 18:11 telegraf-tiger[bot]

Given how GitHub reports "1k" open issues for consul, I didn't really have high hopes :disappointed: This bug clearly still exists as of current versions of telegraf and consul.

thias avatar Nov 22 '22 14:11 thias

Given how GitHub reports "1k" open issues for consul, I didn't really have high hopes disappointed This bug clearly still exists as of current versions of telegraf and consul.

Thanks for filing the issue!

I think one step we might be able to take here is convert the consul input to be a service input which as a specific Start() and Stop() function. The sighup I believe should call the Stop() to let us close the connection.

@srebhan what do you think?

powersj avatar Nov 28 '22 18:11 powersj

@powersj I agree, we should add a Start() and Stop() implementation, making this a mix between service and pull plugin.

srebhan avatar Nov 29 '22 13:11 srebhan

Hmm looking a bit more into the issue, I cannot see how we should cleanup that connection. Furthermore, this API will only issue HTTP requests which should not keep an connection open IIRC... So IMO this leaves us with waiting for upstream to figure out what we should do...

srebhan avatar Dec 16 '22 10:12 srebhan