node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

Limit which "pfs" when querying mellanox cards

Open martbhell opened this issue 3 months ago • 2 comments

https://docs.nvidia.com/dgx/dgx-os-7-user-guide/known_issues.html#id16

With this note:

Accessing PF0 and PF1 is restricted. Currently, there is no temporary solution available.

[11176.517416] mlx5_core 0000:05:00.0: mlx5_cmd_out_err:835:(pid 18360): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.534892] mlx5_core 0000:05:00.0: mlx5_cmd_out_err:835:(pid 18360): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.589308] mlx5_core 0000:05:00.1: mlx5_cmd_out_err:835:(pid 10354): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)
[11176.607052] mlx5_core 0000:05:00.1: mlx5_cmd_out_err:835:(pid 10354): ACCESS_REG(0x805) op_mod(0x1) failed, status bad operation(0x2), syndrome (0x9a6171), err(-22)

I didn't find a github issue about this in here. Maybe there could be a way to limit which PFs node_exporter tries to get data about?

martbhell avatar Oct 02 '25 05:10 martbhell

I have currently selected to disable ethtool metric collection, and logs are no longer being printed. I hope there will be a better solution. Image

zheng199512 avatar Nov 18 '25 10:11 zheng199512

To get rid of the kernel messages I needed to disable the infiniband collector with --no-collector.infiniband We aren't using the collector.ethtool and disabling all the other net* collectors did not get rid of the messages.

martbhell avatar Dec 02 '25 08:12 martbhell