node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

netclass collector fails while trying to get metrics from ignored devices

Open karlism opened this issue 3 years ago • 4 comments

Host operating system: output of uname -a

Linux hostname 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.0.1 (branch: HEAD, revision: 3715be6ae899f2a9b9dbfd9c39f3e09a7bd4559f)
  build user:       root@1f76dbbcfa55
  build date:       20200616-12:44:12
  go version:       go1.14.4

node_exporter command line flags

--collector.textfile.directory=/var/lib/node_exporter/textfile_collector
--collector.ntp
--collector.processes
--collector.netclass.ignored-devices='^veth.*'
--collector.netdev.device-blacklist='^veth.*'

Are you running node_exporter in Docker?

no

What did you do that produced an error?

Docker removed veth interface and that triggered netclass scrape collector failure

What did you expect to see?

I would expect ignored devices not to trigger netclass scrape collector failure (veth.* in this case). In fact node_exporter shouldn't even attempt to get metrics for these devices as that is useless.

What did you see instead?

# cat /var/log/syslog | grep "veth67506fc"
Dec 17 05:15:57 hostname systemd-udevd[44530]: Could not generate persistent MAC address for veth67506fc: No such file or directory
Dec 17 05:15:58 hostname kernel: [9799969.809182] eth1: renamed from veth67506fc
Dec 17 05:15:58 hostname node_exporter[6459]: level=error ts=2020-12-17T05:15:58.486Z caller=collector.go:161 msg="collector failed" name=netclass duration_seconds=0.044157776 err="could not get net class info: error obtaining net class info: open /sys/class/net/veth67506fc: no such file or directory"
Dec 17 05:15:59 hostname kernel: [9799970.764950] veth67506fc: renamed from eth1

Metrics for these devices are not exposed by node_exporter, as expected:

$ ifconfig | grep veth
veth06454cb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth09549bc: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth139b970: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth18767aa: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth1ce105e: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth2a93f53: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth2f43cdf: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth3ea67ac: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth52036bb: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth68de7c9: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth78836e8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth791ccd1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
veth79cf6f7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
vethfb80213: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
$ curl -s localhost:9100/metrics | grep veth
$

karlism avatar Dec 17 '20 06:12 karlism

I'm working on this issue.

corny avatar Apr 11 '21 09:04 corny

While both linked PRs fix this for ignored interfaces, I think the issue itself persists. Since we get a dir listing and then based on that try to open files without any locking, we can not rely on the files being present. I believe this can also happen for devices we don't ignore, when a device (or rather its file on procfs) is removed in between listing /sys/class/net and attempting to open the respective device files. Maybe simply log the failure and skip over the device here could make this more resilient?

jan--f avatar Apr 19 '21 14:04 jan--f

Agreed, this still will be a problem for short lived interfaces that are not ignored. But not convinced that ignoring the error is the best solution. But we could check for ErrNotExist and ignore that. @SuperQ wdyt?

discordianfish avatar Apr 26 '21 09:04 discordianfish

Are there any updates? I meet the same issue.

paradox-lab avatar Mar 23 '23 02:03 paradox-lab