node_exporter
node_exporter copied to clipboard
netclass collector fails while trying to get metrics from ignored devices
Host operating system: output of uname -a
Linux hostname 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of node_exporter --version
node_exporter, version 1.0.1 (branch: HEAD, revision: 3715be6ae899f2a9b9dbfd9c39f3e09a7bd4559f)
build user: root@1f76dbbcfa55
build date: 20200616-12:44:12
go version: go1.14.4
node_exporter command line flags
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector
--collector.ntp
--collector.processes
--collector.netclass.ignored-devices='^veth.*'
--collector.netdev.device-blacklist='^veth.*'
Are you running node_exporter in Docker?
no
What did you do that produced an error?
Docker removed veth interface and that triggered netclass
scrape collector failure
What did you expect to see?
I would expect ignored devices not to trigger netclass
scrape collector failure (veth.*
in this case).
In fact node_exporter
shouldn't even attempt to get metrics for these devices as that is useless.
What did you see instead?
# cat /var/log/syslog | grep "veth67506fc"
Dec 17 05:15:57 hostname systemd-udevd[44530]: Could not generate persistent MAC address for veth67506fc: No such file or directory
Dec 17 05:15:58 hostname kernel: [9799969.809182] eth1: renamed from veth67506fc
Dec 17 05:15:58 hostname node_exporter[6459]: level=error ts=2020-12-17T05:15:58.486Z caller=collector.go:161 msg="collector failed" name=netclass duration_seconds=0.044157776 err="could not get net class info: error obtaining net class info: open /sys/class/net/veth67506fc: no such file or directory"
Dec 17 05:15:59 hostname kernel: [9799970.764950] veth67506fc: renamed from eth1
Metrics for these devices are not exposed by node_exporter
, as expected:
$ ifconfig | grep veth
veth06454cb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth09549bc: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth139b970: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth18767aa: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth1ce105e: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth2a93f53: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth2f43cdf: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth3ea67ac: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth52036bb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth68de7c9: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth78836e8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth791ccd1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
veth79cf6f7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
vethfb80213: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
$ curl -s localhost:9100/metrics | grep veth
$
I'm working on this issue.
While both linked PRs fix this for ignored interfaces, I think the issue itself persists. Since we get a dir listing and then based on that try to open files without any locking, we can not rely on the files being present.
I believe this can also happen for devices we don't ignore, when a device (or rather its file on procfs) is removed in between listing /sys/class/net
and attempting to open the respective device files.
Maybe simply log the failure and skip over the device here could make this more resilient?
Agreed, this still will be a problem for short lived interfaces that are not ignored. But not convinced that ignoring the error is the best solution. But we could check for ErrNotExist and ignore that. @SuperQ wdyt?
Are there any updates? I meet the same issue.