telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

Telegraf logs Errors to Eventlog if started as Service on Windows

Open knollet opened this issue 1 year ago • 2 comments

Relevant telegraf.conf

logtarget = "file"
logfile = "/Program Files/Telegraf/telegraf.log"
logfile_rotation_interval = "24h"
logfile_rotation_max_size = "5MB"
logfile_rotation_max_archives = 5
...
statefile = "/Program Files/Telegraf/statefile.json"

and a statefile.json containing only zero bytes

Logs from Telegraf

If started from terminal, telegraf.log contains

2024-10-22T14:41:31Z E! [telegraf] Error running agent: unmarshalling states failed: invalid character '\x00' looking for beginning of value

If started through services.msc, doesn't contain the error message, as it goes to the Eventlog.

System info

Telegraf 1.32.1

Expected behavior

Not logging anything to the Eventlog, as we didn't specify "eventlog" anywhere

Actual behavior

Logs to Eventlog if started as a service

Additional info

No response

knollet avatar Oct 22 '24 15:10 knollet

@knollet this behavior was introduced in v1.32.0, let me explain the background. When telegraf starts up there are certain steps required before being able to setup logging (e.g. loading the configuration) which can fail. So we decided to log all "startup errors" to the Eventlog, as this is the only resource known to be available during startup, to not loose information if Telegraf is started as a service. This is the same for Linux's systemd startup where we output all those errors to stderr for the service manager to pick up potential errors. The idea is to provide a single point where you can see what went wrong with the service. So this is the expected behavior and not a bug in my view...

Does that make sense?

srebhan avatar Oct 24 '24 09:10 srebhan

No, sorry, it doesn't. (not to me, at least)

When telegraf is starting up, there are some steps required before you can log into the file, you say.

Why is that only the case if it is started as a service and not when started on the terminal? If I start it manually from the terminal (i.e. c:\...telegraf> telegraf) it logs the error into the file.

I can't see the reason why the problem should only be there in only one of the cases.

Also: Why is that only the case if the message to be logged is an error? This problem isn't there for informational messages.

(Some of which are logged to the file, actually, parallel to being logged onto the terminal, e.g. all the "I! Loading config: ..."-lines)

Obviously somewhere there is an if (running_as_service && msg.type == ERR) { log_to_eventlog(msg) } else { log_to_file(msg) }. That IMHO should be

if (running_as_service && msg.type == ERR) { log_to_eventlog(msg) }
// no else!
log_to_file(msg) // logging to file if possible
log_to_stdout_err(msg)

In conclusion, I would say,

  • the principle of least surprise is being violated here as there is a config option to log to the eventlog which we explicitly didn't use and the message went to the eventlog anyway (and only to the eventlog)
  • Error messages are the most important messages, but in the case of being started as a service the most hard to find. Informational messages ("Loading config ...") OTOH, if started manually on a terminal, are just logged everywhere: terminal and logfile...
  • "telegraf.log may not be available" as a reason for the errors only being logged to eventlog are flaky at best, in our case just weird, because there already has been logging to telegraf.log at that point.

knollet avatar Oct 29 '24 16:10 knollet