ansible
ansible copied to clipboard
node_exporter systemd unit file incorrectly formatted when using sysctl.include collector
Bug Summary
Installing node exporter on an EC2 instance configured with the Amazon Linux 2 AMI (systemd 219) fails:
TASK [prometheus.prometheus.node_exporter : Ensure Node Exporter is enabled on boot] ***
fatal: [default]: FAILED! => {"changed": false, "msg": "Error loading unit file 'node_exporter': org.freedesktop.DBus.Error.InvalidArgs \"Invalid argument\""}
Here's the playbook being used:
- hosts: 127.0.0.1
vars:
node_exporter_enabled_collectors:
- sysctl:
include:
vm:
- overcommit_memory
- overcommit_ratio
- dirty_background_bytes
- dirty_background_bytes
- dirty_background_ratio
- dirty_bytes
- dirty_expire_centisecs
- dirty_ratio
- swappiness
roles:
- prometheus.prometheus.node_exporter
Upon further investigation, it appears the systemd unit file becomes malformed when attempting to wrap the sysctl.include collector in single quotes:
#
# Ansible managed
#
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
[Service]
Type=simple
User=node-exp
Group=node-exp
ExecStart=/usr/local/bin/node_exporter \
'--collector.sysctl.include={'vm': ['overcommit_memory', 'overcommit_ratio', 'dirty_background_bytes', 'dirty_background_bytes', 'dirty_background_ratio', 'dirty_bytes', 'dirty_expire_centisecs', 'dirty_ratio', 'swappiness']}'
SyslogIdentifier=node_exporter
Restart=always
RestartSec=1
StartLimitInterval=0
ProtectHome=yes
NoNewPrivileges=yes
ProtectSystem=full
[Install]
WantedBy=multi-user.target
On line 14, you can see the single quote wrapping is prematurely terminated once it reaches 'vm'
. More details can be found when checking the status of the service or using journalctl
:
$ sudo systemctl status node_exporter
● node_exporter.service - Prometheus Node Exporter
Loaded: error (Reason: Invalid argument)
Active: failed (Result: resources) since Wed 2024-04-24 15:37:55 UTC; 20s ago
Main PID: 2443 (code=killed, signal=KILL)
Apr 24 15:37:54 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service: main process exited, code=killed, status=9/KILL
Apr 24 15:37:54 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: Unit node_exporter.service entered failed state.
Apr 24 15:37:54 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service failed.
Apr 24 15:37:55 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service holdoff time over, scheduling restart.
Apr 24 15:37:55 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service failed to schedule restart job: Unit is not loaded properly: Invalid argument.
Apr 24 15:37:55 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: Unit node_exporter.service entered failed state.
Apr 24 15:37:55 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service failed.
Apr 24 15:38:12 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: [/etc/systemd/system/node_exporter.service:13] Trailing garbage, ignoring.
Apr 24 15:38:12 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service lacks both ExecStart= and ExecStop= setting. Refusing.
Proposed Solution
This can be fixed by using double quotes for wrapping each collector argument being passed to node_exporter
in the node_exporter.service.j2 template.
#
# Ansible managed
#
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
[Service]
Type=simple
User=node-exp
Group=node-exp
ExecStart=/usr/local/bin/node_exporter \
"--collector.sysctl.include={'vm': ['overcommit_memory', 'overcommit_ratio', 'dirty_background_bytes', 'dirty_background_bytes', 'dirty_background_ratio', 'dirty_bytes', 'dirty_expire_centisecs', 'dirty_ratio', 'swappiness']}" \
SyslogIdentifier=node_exporter
Restart=always
RestartSec=1
StartLimitInterval=0
ProtectHome=yes
NoNewPrivileges=yes
ProtectSystem=full
[Install]
WantedBy=multi-user.target