telegraf
telegraf copied to clipboard
[outputs.postgresql] add setting to retry connection
Relevant telegraf.conf
[[outputs.postgresql]]
connection = "host=timescaledb1.intra,host=timescaledb2.intra user=telegraf password=XXX dbname=telegraf sslmode=disable target_session_attrs=read-write"
schema = "public"
create_templates = [
'''CREATE TABLE {{ .table }} ({{ .columns }})''',
'''SELECT create_hypertable({{ .table|quoteLiteral }}, 'time', chunk_time_interval => INTERVAL '24 hours')''',]
[[outputs.influxdb]]
urls = ["http://influxdb1.intra"]
database = "telegraf"
username = "telegraf"
password = "secret"
retention_policy = ""
write_consistency = "any"
timeout = "5s"
[outputs.influxdb.tagdrop]
influxdb_database = ["*"]
Logs from Telegraf
Nov 29 07:37:43 collector1 telegraf[963895]: 2023-11-29T06:37:43Z E! [outputs.postgresql] PG connect failed - map[err:failed to connect to `host=timescaledb1.intra user=telegraf database=telegraf`: hostname resolving error (lookup timescaledb1.intra on 10.10.10.53:53: no such host)]
Nov 29 07:37:43 collector1 telegraf[963895]: 2023-11-29T06:37:43Z E! [outputs.postgresql] Couldn't connect to server
Nov 29 07:37:43 collector1 telegraf[963895]: failed to connect to `host=timescaledb1.intra user=telegraf database=telegraf`: hostname resolving error (lookup timescaledb1.intra on 10.10.10.53:53: no such host)
Nov 29 07:37:43 collector1 telegraf[963895]: 2023-11-29T06:37:43Z E! [agent] Failed to connect to [outputs.postgresql], retrying in 15s, error was "failed to connect to `host=timescaledb1.intra user=telegraf database=telegraf`: hostname resolving error (lookup timescaledb1.intra on 10.10.10.53:53: no such host)"
System info
Telegraf 1.28.2, Debian 11 Bullseye
Docker
No response
Steps to reproduce
- Create multiple outputs (influx, timescaledb, ...)
- In one output configure server name, which can't be resolved by DNS.
- Restart
telegraf.service
and from that moment the Telegraf stops sending data to all outputs.
Expected behavior
An error when resolving the server in one output will not affect the functionality of the other outputs.
Actual behavior
For some reason, DNS failed to translate timescaledb1.intra
to an IP address (see [[outputs.postgresql]]
) and after restarting telegraf.service
, Telegraf stopped sending data to all outputs.
Additional info
No response
In general, if we cannot connect an output telegraf will not start. This is the expected behavior as it prevents scenarios where a user is using a wrong password or has otherwise incorrectly configured the output connection.
We are happy to see PRs to allow per-plugin exceptions, disabled by default, where the plugin would continue to try to reconnect, usually during each write attempt. We can use this issue for the postgresql output.
@idahomst can you please test the binary in PR #15073 available once CI finished the tests!? Using startup_error_behavior = "ignore"
or startup_error_behavior = "retry"
should do what you requested. Let me know if the PR fixes the issue!
Hi @srebhan, I tested "retry" and "ignore" too and both works great. Thank you! I look forward to finding Telegraf v1.31.0 in the Debian repositories. ;)