node-exporter-textfile-collector-scripts icon indicating copy to clipboard operation
node-exporter-textfile-collector-scripts copied to clipboard

mellanox_hca_temp: suppress errors when virtual functions present

Open ik5pvx opened this issue 1 year ago • 2 comments

VFs appear as additional infiniband devices, but obviously don't report temperatures. The logs are then flooded with:

Dec 17 00:00:53 penny sh[1151952]: mopen: Operation not supported
Dec 17 00:00:53 penny sh[1151947]: mellanox_hca_temp: Failed to get temperature from InfiniBand HCA 'mlx5_2'!
Dec 17 00:00:53 penny sh[1151953]: mopen: Operation not supported
Dec 17 00:00:53 penny sh[1151947]: mellanox_hca_temp: Failed to get temperature from InfiniBand HCA 'mlx5_3'!

and so on.

The only clue I could find to recognise them as virtual is that node_guid is 0000:0000:0000:0000. I'm not sure if this is supposed to change when setting the mac address on the interfaces.

So far, with the virtual function interfaces unconfigured, the following patch suppresses the errors for me:

--- mellanox_hca_temp.orig      2021-06-27 08:55:33.406292246 +0200
+++ mellanox_hca_temp   2023-12-22 15:18:46.072149247 +0100
@@ -41,6 +41,10 @@
     if test ! -d "$dev"; then
         continue
     fi
+    # node_guid is all zeros for Virtual Functions, which report no temp.
+    if [ "$(cat $dev/node_guid)" = "0000:0000:0000:0000" ]; then
+       continue
+    fi
     device="${dev##*/}"
 
     # get temperature

ik5pvx avatar Dec 22 '23 14:12 ik5pvx