infiniband_exporter
infiniband_exporter copied to clipboard
panic when collector ibswinfo is enabled (docker container)
Hi, I'm running the latest docker container. When I enable collector=ibswinfo I get a panic when I request metrics (and no metrics returned). Without that collector enabled the container works well and I'm able to scrape the basic set of metrics.
created by net/http.(*Server).Serve in goroutine 1
/usr/local/go/src/net/http/server.go:3285 +0x4b4
2024/08/15 14:40:34 http: panic serving x.x.x.x:36488: descriptor Desc{fqName: "infiniband_switch_collect_timeout", help: "Indicates if collect timeout", constLabels: {}, variableLabels: {guid,collector}} already exists with the same fully-qualified name and const label values
goroutine 15 [running]:
net/http.(*conn).serve.func1()
/usr/local/go/src/net/http/server.go:1898 +0xbe
panic({0x8ae680?, 0xc000716bf0?})
/usr/local/go/src/runtime/panic.go:770 +0x132
What is the full command used to launch the exporter? If you have HCA or switch collection enabled try passing --no-collector.hca or --no-collector.switch and provide what metric is returned for infiniband_switch_collect_timeout. It's not clear how there could be duplicate metrics for timeout since each collector has unique collector label.
Here is the command we use, pretty much from your example except added the collector flags:
docker run -d -p 9315:9315 \
--name infiniband_exporter \
--cap-add=IPC_LOCK \
--device=/dev/infiniband/umad0 \
treydock/infiniband_exporter --collector.switch --collector.ibswinfo
When I change to --no-collector.switch the daemon no longer panics on serving metrics. Now I can see repeated log error message about missing ibswinfo (repeated for a bunch of switches). Is ibswinfo not in the container?
ts=2024-08-16T20:50:34.917Z caller=ibswinfo.go:210 level=error collector=ibswinfo msg="Error collecting ibswinfo data" err="exec: \"ibswinfo\": executable file not found in $PATH:" guid=0xf4521403000c5110 lid=303
What metrics are present related to infiniband_switch_collect_timeout? You should be able to do something like this when launched with --no-collector.switch:
curl http://localhost:9315/metrics | grep infiniband_switch_collect_timeout
I'm curious why there are metric conflicts when errors occur.
ibswinfo was not added to the container and not sure it will work. I'll attempt to test but have limited access to hosts with Docker on our IB fabric.
I found the problem and testing fixes in #32
Thanks for looking into it! What I ended up doing in the short term is running the docker container with just the switch collector, and then running another instance (outside docker) with ibswinfo but no switch collector. It works but I'll give the fixed package a try when I can.
This is part of https://github.com/treydock/infiniband_exporter/releases/tag/v0.10.0-rc.1. The new docker image is also pushed with v0.10.0-rc.1 tag. I need to do more testing and also merge another PR that needs some extra testing before there will be non-RC release.