telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

outputs.health: add option to change default state

Open kpdo opened this issue 9 months ago • 4 comments

Use Case

Currently, the default state of outputs.health is healthy. If telegraf is stuck in a restart loop outputs.health would always return healthy because there isn't enough time for the first check to be evaluated until telegraf restarts again.

Expected behavior

I'd like there to be an option to change the default state to unhealthy.

Actual behavior

Currently, there is no option to change the default state.

Additional info

telegraf config to cause a restart loop:

[[outputs.health]]
  service_address = "http://:14004"
  namepass = ["mock"] 

  [[outputs.health.compares]]
    field = "foo"
    eq = 100.0 # should always fail

[[outputs.file]]
  namepass = ["mock"]
  files = ["/tmp/test.txt"] # file without write permission; causes telegraf to restart every 15s

[[inputs.mock]]
  interval = "10s"
  metric_name = "mock"
  [[inputs.mock.constant]]
    name = "foo"
    value = 42.0 

telegraf logs:

2025-03-18T07:21:16Z I! Starting Telegraf 1.32.1 brought to you by InfluxData the makers of InfluxDB
2025-03-18T07:21:16Z I! Available plugins: 235 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 6 secret-stores
2025-03-18T07:21:16Z I! [outputs.health] Listening on http://[::]:14004
2025-03-18T07:21:16Z E! [agent] Failed to connect to [outputs.file], retrying in 15s, error was "open /tmp/test.txt: permission denied"
2025-03-18T07:21:31Z E! [telegraf] Error running agent: connecting output outputs.file: error connecting to output "outputs.file": open /tmp/test.txt: permission denied
2025-03-18T07:21:32Z I! Starting Telegraf 1.32.1 brought to you by InfluxData the makers of InfluxDB
2025-03-18T07:21:32Z E! [agent] Failed to connect to [outputs.file], retrying in 15s, error was "open /tmp/test.txt: permission denied"
2025-03-18T07:21:47Z E! [telegraf] Error running agent: connecting output outputs.file: error connecting to output "outputs.file": open /tmp/test.txt: permission denied
2025-03-18T07:21:47Z I! Starting Telegraf 1.32.1 brought to you by InfluxData the makers of InfluxDB
2025-03-18T07:21:47Z I! [outputs.health] Listening on http://[::]:14004
2025-03-18T07:21:47Z E! [agent] Failed to connect to [outputs.file], retrying in 15s, error was "open /tmp/test.txt: permission denied"
2025-03-18T07:22:02Z E! [telegraf] Error running agent: connecting output outputs.file: error connecting to output "outputs.file": open /tmp/test.txt: permission denied
2025-03-18T07:22:03Z I! Starting Telegraf 1.32.1 brought to you by InfluxData the makers of InfluxDB
2025-03-18T07:22:03Z I! [outputs.health] Listening on http://[::]:14004
2025-03-18T07:22:03Z E! [agent] Failed to connect to [outputs.file], retrying in 15s, error was "open /tmp/test.txt: permission denied"

kpdo avatar Mar 18 '25 07:03 kpdo

How about setting 425 Too Early or 503 Service Unavailable until at least one condition was evaluated? This would allow to distinguish between "not ready yet", "unhealthy" and "healty"...

srebhan avatar Apr 03 '25 15:04 srebhan

Or we even allow the user to specify a HTTP status code?

srebhan avatar Apr 03 '25 15:04 srebhan

How about setting 425 Too Early or 503 Service Unavailable until at least one condition was evaluated? This would allow to distinguish between "not ready yet", "unhealthy" and "healty"...

You mean generally without the user configuring anything? I would be fine with it but that's a breaking change. That's why I came up with the idea of making it configurable instead.

Also, distinguishing between "not ready yet", "unhealthy" and "healthy" sounds like a good idea but from what I read "425 Too Early" wouldn't be appropiate. You would return a 425 if the client sends a request before the TLS handshake has been completed.

I think the only appropiate code for "not ready yet" would be "503 Service Unavailable" so the same as "unhealthy". You wouldn't be able to distinguish between "not ready yet" and "unhealthy" with that but that's fine with me.

kpdo avatar Apr 07 '25 12:04 kpdo

Then we need to option indeed. Someone needs to implement the option like initial_status_code or similar and put up a PR...

srebhan avatar Apr 14 '25 09:04 srebhan