telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

windows: As a service, telegraf logs do not capture error message on failure

Open cwjh opened this issue 1 year ago • 8 comments

Relevant telegraf.conf

[[inputs.nvidia_smi]]
  bin_path = "C:\\Windows\\System32\\nvidia-smi.exe"

Logs from Telegraf

2023-10-19T06:31:47Z I! Starting Telegraf 1.28.2 brought to you by InfluxData the makers of InfluxDB
2023-10-19T06:31:47Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-10-19T06:31:47Z I! Loaded inputs: mem nvidia_smi system tail win_perf_counters
2023-10-19T06:31:47Z I! Loaded aggregators: 
2023-10-19T06:31:47Z I! Loaded processors: 
2023-10-19T06:31:47Z I! Loaded secretstores: 
2023-10-19T06:31:47Z I! Loaded outputs: influxdb
2023-10-19T06:31:47Z I! Tags enabled: redacted
2023-10-19T06:31:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"redacted", Flush Interval:10s
2023-10-19T06:31:47Z D! [agent] Initializing plugins

System info

Telegraf 1.28.2, Windows 11

Docker

No response

Steps to reproduce

  1. Install telegraf as service and configure telegraf.conf with [[inputs.nvidia_smi]] enabled.
  2. On non-Nvidia equipped machines (w/o the nvidia-smi.exe file), start the telegraf service.
  3. Attempt to stop the telegraf service.

Expected behavior

With the inputs.nvidia_smi enabled in the config:

  • For nvidia equipped system it should start and collects metrics normally.

  • For non-nvidia equipped system, the nvidia-smi.exe is not found and the plugin should fail.

  • Telegraf service should write error in the log file, indicating nvidia-smi.exe is not found and the plugin failed to load.

  • Telegraf service should continue with other plugins and collects and sends metrics.

  • Telegraf service should stop normally and logs relevant info when instructed to stop.

Actual behavior

  • For non-nvidia equipped system and the nvidia-smi.exe is not found.
  • Telegraf service does not write error in the log file.
  • With debug turned off, last entry in log is: I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
  • With debug turned on, last entry in log is: D! [agent] Initializing plugins. No additional log entries are written.
  • No metrics is collected or sent.
  • When attempting to stop Telegraf service, an error is shown "Windows could not stop the Telegraf Data Collector Service service on . Error 1067: The process terminated unexpectedly.
  • No additional stopping log entries are written.

Additional info

We use the same config files and deploy to all workstations, some with nvidia cards, some don't. Previously with telegraf 1.17, this is fine for both type of workstations. Ones without a nvidia card will simply error on the plugin but still collects metrics. But since we've upgraded to 1.26 (and also tested latest 1.28), workstations without a nvidia card will show as if the telegraf has started without error (no error in log), but the actual service is hung and no metrics are collected/sent.

If a nvidia-smi.exe file is dropped into the c:\windows\system32\ location, the 1.2x telegraf service will runs, collects, sends, and stops normally (with error logged), just the same as 1.17 behaved. Even if the nvidia-smi.exe is a dummy 0 byte file. It appears telegraf can handle errors from plugins gracefully, but unable to handle file-not-found situation gracefully.

Also, with 1.2x, running telegraf from command line (with or without --debug): telegraf.exe --config 'c:\program files\telegraf\telegraf.conf', the log will correctly show "E! [telegraf] Error running agent: could not initialize input inputs.nvidia_smi: nvidia-smi not found in "C:\Windows\System32\nvidia-smi.exe" and not in PATH; please make sure nvidia-smi is installed and/or is in PATH". But other plugins and metrics are still not collected/sent.

cwjh avatar Oct 19 '23 07:10 cwjh

@cwjh this is not how Telegraf (currently) works. Telegraf will not silently ignore misconfigured plugins to ensure that you are at a defined state once Telegraf started, namely that all plugins work as expected. There are plans to add defined startup strategies like "retry N times than fail" or "retry but never fail" but this also targets "retryable" errors. If you configure a plugin for stuff that is not there it is an error in Telegraf's view!

This being said, why don't you just remove the plugin or start Telegraf with --input-filter leaving out the NVidia plugin?

srebhan avatar Oct 19 '23 17:10 srebhan

Hi, thanks for the clarification! That makes sense Telegraf should error with misconfigured plugins. Although it'd be nice to have for telegraf to continue with other plugins, I understand the rationale and accept that Expected Behavior bullet no. 4 can be crossed out.

That being said, I believe it's still a bug such that Telegraf (installed as service) does "die" silently when the nvidia-smi.exe is not found at service start-up, with no error written to the log file. And when instructed to stop, it will not gracefully stop but terminates unexpectedly. See Actual behaviors and Expected behaviors (excluding no. 4) listed above.

In contrast, if nvidia-smi.exe is deleted AFTER telegraf service has started successfully, it'll write in log "Error in plugins" for file does not exist. Also if running telegraf from console, it'll also write in log ""Error running agent". These meet the Expected Behavior when encountered error.

As for changing config, it's certainly possible, though the Group Policy used to deploy the config would need to be customized to deploy different configs based on computer, and also double the number of configs based on w/ or w/o nvidia card. It may be easier with flavored start up commands but also introduce differences in service commands installed.

cwjh avatar Oct 24 '23 10:10 cwjh

@cwjh,

It is not clear to me at all what needs to change. The binary path is checked during init of the plugin, if there is an error it is returned and sent through the agent. Errors during init will stop the client. There is no graceful shutdown to even do at this point:

$ ./telegraf --config config.toml
2023-10-24T14:13:46Z I! Loading config: config.toml
2023-10-24T14:13:46Z I! Starting Telegraf 1.29.0-ab24e78c brought to you by InfluxData the makers of InfluxDB
2023-10-24T14:13:46Z I! Available plugins: 241 inputs, 9 aggregators, 29 processors, 24 parsers, 60 outputs, 5 secret-stores
2023-10-24T14:13:46Z I! Loaded inputs: exec nvidia_smi
2023-10-24T14:13:46Z I! Loaded aggregators: 
2023-10-24T14:13:46Z I! Loaded processors: enum
2023-10-24T14:13:46Z I! Loaded secretstores: 
2023-10-24T14:13:46Z I! Loaded outputs: file
2023-10-24T14:13:46Z I! Tags enabled: 
2023-10-24T14:13:46Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2023-10-24T14:13:46Z D! [agent] Initializing plugins
2023-10-24T14:13:46Z E! [telegraf] Error running agent: could not initialize input inputs.nvidia_smi: nvidia-smi not found in "/usr/bin/foobar-nvidia" and not in PATH; please make sure nvidia-smi is installed and/or is in PATH
$

powersj avatar Oct 24 '23 14:10 powersj

It appears to only affect when installed and executed as a Windows service, no error will be logged in the log file.

2023-10-19T06:31:47Z I! Starting Telegraf 1.28.2 brought to you by InfluxData the makers of InfluxDB
2023-10-19T06:31:47Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-10-19T06:31:47Z I! Loaded inputs: mem nvidia_smi system tail win_perf_counters
2023-10-19T06:31:47Z I! Loaded aggregators: 
2023-10-19T06:31:47Z I! Loaded processors: 
2023-10-19T06:31:47Z I! Loaded secretstores: 
2023-10-19T06:31:47Z I! Loaded outputs: influxdb
2023-10-19T06:31:47Z I! Tags enabled: redacted
2023-10-19T06:31:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"redacted", Flush Interval:10s
2023-10-19T06:31:47Z D! [agent] Initializing plugins
(^^^ this above would be the last entry in log when Telegraf ran as installed Services on Windows. No entry for error or anything else.)

In contrast, when executing telegraf from the command prompt, it does log the error entry for the missing nvidia-smi.exe.

cwjh avatar Oct 27 '23 00:10 cwjh

The way that Telegraf is executed via the CLI and via a Winodws appears very different. The Windows-specific Run() has two paths:

1 - when not run as a service, the usual stop channel and reload loop are called. This is the same for posix systems. 2 - executes a call to the service's Run function

It does seem that when there is a reported error, that the error is not making its way back into the logs.

Happy to have someone much more knowledgeable about Windows and services put up a PR with a fix.

powersj avatar Oct 27 '23 17:10 powersj

During lunch I've been reading old golang issues:

  • https://github.com/golang/go/issues/40157: final log message is not shown, points to next issue
  • https://github.com/golang/go/issues/40167: this was fixed way back in go1.16

Both descriptions line up with what you are reporting. I went through the steps to reproduce on #40160 and was able to see the logs successfully when stopping. I would suggest someone look at that example and see how it compares to how Telegraf launches the service (e.g. do we actually close the event logger).

powersj avatar Oct 27 '23 18:10 powersj

@cwjh Are you noticing any errors in the Windows event log?

I think I am experiencing some form of this bug with our production deployment of Telegraf. Our issue stems around when Windows updates are applied, sometimes the telegraf agent does not seem to have enough time to start during those reboots and it will nearly silently fail. We have every possible safeguard (all "recovery" options set to maximum retries, delayed start set up, the service itself depending on the network being available, etc), however the Telegraf service still manages to sometimes fail without any error messages to the telegraf log.

I say that Telegraf fails nearly silently because although we get no telegraf logs, we do see a single log in the Windows event log: https://community.influxdata.com/t/telegraf-service-doesnt-start-after-windows-updates-error-7009-a-timeout-was-reached/29379

Doing some research I believe that these issues could have a fix which was recently merged in with the windows_exporter project, it revolves around calling logging and the service portions as early as possible in the initialization process:

https://github.com/prometheus-community/windows_exporter/pull/1047

gr8hound avatar Nov 29 '23 15:11 gr8hound

@cwjh and @gr8hound could you please try the binary in PR #15234, available as soon as CI finished the tests, and let me know if this fixes the issue!

srebhan avatar Apr 25 '24 17:04 srebhan

@cwjh and @gr8hound please check the binary in PR #15346 and let me know if this fixes the issue!

srebhan avatar May 13 '24 20:05 srebhan