outputs.health: add option to change default state
Use Case
Currently, the default state of outputs.health is healthy. If telegraf is stuck in a restart loop outputs.health would always return healthy because there isn't enough time for the first check to be evaluated until telegraf restarts again.
Expected behavior
I'd like there to be an option to change the default state to unhealthy.
Actual behavior
Currently, there is no option to change the default state.
Additional info
telegraf config to cause a restart loop:
[[outputs.health]]
service_address = "http://:14004"
namepass = ["mock"]
[[outputs.health.compares]]
field = "foo"
eq = 100.0 # should always fail
[[outputs.file]]
namepass = ["mock"]
files = ["/tmp/test.txt"] # file without write permission; causes telegraf to restart every 15s
[[inputs.mock]]
interval = "10s"
metric_name = "mock"
[[inputs.mock.constant]]
name = "foo"
value = 42.0
telegraf logs:
2025-03-18T07:21:16Z I! Starting Telegraf 1.32.1 brought to you by InfluxData the makers of InfluxDB
2025-03-18T07:21:16Z I! Available plugins: 235 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 6 secret-stores
2025-03-18T07:21:16Z I! [outputs.health] Listening on http://[::]:14004
2025-03-18T07:21:16Z E! [agent] Failed to connect to [outputs.file], retrying in 15s, error was "open /tmp/test.txt: permission denied"
2025-03-18T07:21:31Z E! [telegraf] Error running agent: connecting output outputs.file: error connecting to output "outputs.file": open /tmp/test.txt: permission denied
2025-03-18T07:21:32Z I! Starting Telegraf 1.32.1 brought to you by InfluxData the makers of InfluxDB
2025-03-18T07:21:32Z E! [agent] Failed to connect to [outputs.file], retrying in 15s, error was "open /tmp/test.txt: permission denied"
2025-03-18T07:21:47Z E! [telegraf] Error running agent: connecting output outputs.file: error connecting to output "outputs.file": open /tmp/test.txt: permission denied
2025-03-18T07:21:47Z I! Starting Telegraf 1.32.1 brought to you by InfluxData the makers of InfluxDB
2025-03-18T07:21:47Z I! [outputs.health] Listening on http://[::]:14004
2025-03-18T07:21:47Z E! [agent] Failed to connect to [outputs.file], retrying in 15s, error was "open /tmp/test.txt: permission denied"
2025-03-18T07:22:02Z E! [telegraf] Error running agent: connecting output outputs.file: error connecting to output "outputs.file": open /tmp/test.txt: permission denied
2025-03-18T07:22:03Z I! Starting Telegraf 1.32.1 brought to you by InfluxData the makers of InfluxDB
2025-03-18T07:22:03Z I! [outputs.health] Listening on http://[::]:14004
2025-03-18T07:22:03Z E! [agent] Failed to connect to [outputs.file], retrying in 15s, error was "open /tmp/test.txt: permission denied"
How about setting 425 Too Early or 503 Service Unavailable until at least one condition was evaluated? This would allow to distinguish between "not ready yet", "unhealthy" and "healty"...
Or we even allow the user to specify a HTTP status code?
How about setting
425 Too Earlyor503 Service Unavailableuntil at least one condition was evaluated? This would allow to distinguish between "not ready yet", "unhealthy" and "healty"...
You mean generally without the user configuring anything? I would be fine with it but that's a breaking change. That's why I came up with the idea of making it configurable instead.
Also, distinguishing between "not ready yet", "unhealthy" and "healthy" sounds like a good idea but from what I read "425 Too Early" wouldn't be appropiate. You would return a 425 if the client sends a request before the TLS handshake has been completed.
I think the only appropiate code for "not ready yet" would be "503 Service Unavailable" so the same as "unhealthy". You wouldn't be able to distinguish between "not ready yet" and "unhealthy" with that but that's fine with me.
Then we need to option indeed. Someone needs to implement the option like initial_status_code or similar and put up a PR...