Agent fails to start if log and trace configurations are omitted.
Describe the bug
The CloudWatch Agent fails to start if log and trace configurations are omitted. It's assumed at least one of the two exist.
Details
When config-translator is given an amazon-cloudwatch-agent.json without any tracing configurations, it will first generate an amazon-cloudwatch-agent.yaml that contains null and then delete it.
https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/cmd/config-translator/translator.go#L130
https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/translator/cmdutil/translatorutil.go#L237
start-amazon-cloudwatch-agent does not check if amazon-cloudwatch-agent.yaml is deleted before calling amazon-cloudwatch-agent ... -otelconfig {...}/amazon-cloudwatch-agent.yaml.
https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/cmd/start-amazon-cloudwatch-agent/path.go#L68-L74
When the CloudWatch Agent attempts to read the various config files, it assumes amazon-cloudwatch-agent.yaml will always exist if no logging configurations are specified.
https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/cmd/amazon-cloudwatch-agent/amazon-cloudwatch-agent.go#L309-L332
The path for amazon-cloudwatch-agent.yaml is then passed to an OpenTelemetry configuration provider. When the provider attempts to read the file, it throws a not found error.
2024-08-28T06:06:41Z E! [telegraf] Error running agent: cannot resolve the configuration: cannot retrieve the configuration: unable to read the file file:/run/amazon-cloudwatch-agent/amazon-cloudwatch-agent.yaml: open /run/amazon-cloudwatch-agent/amazon-cloudwatch-agent.yaml: no such file or directory
Steps to reproduce
Start the CloudWatch Agent with log and trace configurations omitted.
What did you expect to see?
The agent doesn't crash.
What did you see instead?
The agent crashes.
What version did you use?
v1.300045.0
What config did you use?
amazon-cloudwatch-agent.json
{
"agent": {
"debug": true,
"logfile": "/var/log/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log",
"region": "ap-northeast-1"
}
}
Environment
OS: NixOS
Additional context
https://github.com/NixOS/nixpkgs/pull/337212#discussion_r1734066365
We're currently trying to add amazon-cloudwatch-agent to the Nix package manager and a systemd unit to NixOS.
This currently involves rewriting the systemd configuration provided in this repository since it can't be used in NixOS due to the provided systemd configuration using start-amazon-cloudwatch-agent which hardcodes the agent installation directory.
https://github.com/aws/amazon-cloudwatch-agent/blob/v1.300045.0/packaging/dependencies/amazon-cloudwatch-agent.service
https://github.com/aws/amazon-cloudwatch-agent/issues/1319
The resulting systemd configuration looks approximately like this:
[Unit]
Description=Amazon CloudWatch Agent
After=network.target
[Service]
Type=simple
RuntimeDirectory=amazon-cloudwatch-agent
LogsDirectory=amazon-cloudwatch-agent
ExecStartPre={install directory}/bin/config-translator \
-config {...}/common-config.toml \
-input {...}/amazon-cloudwatch-agent.json \
-input-dir {...}/amazon-cloudwatch-agent.d \
-output ${RUNTIME_DIRECTORY}/amazon-cloudwatch-agent.toml
ExecStart={install directory}/bin/amazon-cloudwatch-agent \
-config ${RUNTIME_DIRECTORY}/amazon-cloudwatch-agent.toml \
-envconfig ${RUNTIME_DIRECTORY}/env-config.json \
-otelconfig ${RUNTIME_DIRECTORY}/amazon-cloudwatch-agent.yaml \
-pidfile ${RUNTIME_DIRECTORY}/amazon-cloudwatch-agent.pid
KillMode=process
Restart=on-failure
RestartSec=60s
[Install]
WantedBy=multi-user.target
This effectively does the same thing as start-amazon-cloudwatch-agent but without the path hardcoding.
Like start-amazon-cloudwatch-agent, this will always pass the -otelconfig option to amazon-cloudwatch-agent even if config-translator deletes the expected amazon-cloudwatch-agent.yaml file.
This was uncovered when running a NixOS test for this systemd unit which:
- Starts a VM running NixOS with the agent as a systemd service. The agent is in
onPremisemode without any log, metric, or trace configurations. - Waits for the agent service to be active.
- Checks for the configuration files generated by
config-translatorand the PID file generated by the agent.
We noticed the agent was repeatedly crashing right after systemd started it. Checking the agent logs revealed this file not found error.