TensorHive icon indicating copy to clipboard operation
TensorHive copied to clipboard

Missing 'GPU' entries in metrics

Open roscisz opened this issue 5 years ago • 8 comments

Hmmm I cannot see GPUs even when I click on the "+" sign, nothing happens, I guess the http request used is "/api/0.3.1/nodes/metrics" ? If so here are the contents of the response:

{
  "gpu1.***.com": {
    "CPU": {
      "CPU_gpu1.***.com": {
        "index": 0,
        "metrics": {
          "mem_free": {
            "unit": "MiB",
            "value": 1295
          },
          "mem_total": {
            "unit": "MiB",
            "value": 192925
          },
          "mem_used": {
            "unit": "MiB",
            "value": 152775
          },
          "utilization": {
            "unit": "%",
            "value": 6.15449
          }
        }
      }
    }
  },
  "gpu2.***.com": {
    "CPU": null
  },
  "gpu3.***.com": {
    "CPU": {
      "CPU_gpu3.***.com": {
        "index": 0,
        "metrics": {
          "mem_free": {
            "unit": "MiB",
            "value": 1039
          },
          "mem_total": {
            "unit": "MiB",
            "value": 192925
          },
          "mem_used": {
            "unit": "MiB",
            "value": 40694
          },
          "utilization": {
            "unit": "%",
            "value": 1.45833
          }
        }
      }
    }
  }
}

I replaced all hostnames with fakes ones.

Here are free -m results on all my machines: (OS are Centos 7):

[root@gpu1 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:         192925      166473        7244        5040       19207       19968
Swap:         15931       14994         937
[root@gpu2 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:         192925      158857        1487        6334       32579       26228
Swap:         15931        7748        8183
[root@gpu3 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:         192925       40692        1036       28797      151196      122702
Swap:         15931           0       15931

Also nvidia-smi:

[root@gpu1 ~]# nvidia-smi
Tue Sep  1 14:23:26 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:14:00.0 Off |                    0 |
| N/A   28C    P0    49W / 250W |  21956MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:15:00.0 Off |                    0 |
| N/A   36C    P0    49W / 250W |  14135MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:39:00.0 Off |                    0 |
| N/A   54C    P0   163W / 250W |  21849MiB / 22919MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   52C    P0   177W / 250W |  21886MiB / 22919MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 00000000:88:00.0 Off |                    0 |
| N/A   23C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 00000000:89:00.0 Off |                    0 |
| N/A   30C    P0    50W / 250W |  18663MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   25C    P8    10W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P40           Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   22C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4301      C   ...er_conda/miniconda/envs/CCR/bin/python3 18787MiB |
|    0     44728      C   ...se180025/.conda/envs/ner_env/bin/python  3157MiB |
|    1     16768      C   ...er_conda/miniconda/envs/CCR/bin/python3 14125MiB |
|    2     39814      C   ...se170020/.conda/envs/GPUtest/bin/python 21837MiB |
|    3      4301      C   ...er_conda/miniconda/envs/CCR/bin/python3   619MiB |
|    3     23651      C   ...se170020/.conda/envs/GPUtest/bin/python 21255MiB |
|    5      4301      C   ...er_conda/miniconda/envs/CCR/bin/python3 18653MiB |
+-----------------------------------------------------------------------------+

[root@gpu2 ~]# nvidia-smi
Tue Sep  1 14:23:53 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:14:00.0 Off |                    0 |
| N/A   28C    P0    48W / 250W |  21926MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:15:00.0 Off |                    0 |
| N/A   33C    P0    49W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:39:00.0 Off |                    0 |
| N/A   29C    P0    49W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   28C    P0    49W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 00000000:88:00.0 Off |                    0 |
| N/A   33C    P0    48W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 00000000:89:00.0 Off |                    0 |
| N/A   25C    P0    48W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   52C    P0   133W / 250W |  21867MiB / 22919MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P40           Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   47C    P0   207W / 250W |  21867MiB / 22919MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    0      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3 21767MiB |
|    1      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    1      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    2      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    2      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    3      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    3      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    4      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    4      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    5      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    5      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    6      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    6      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    6     27039      C   ...se170020/.conda/envs/GPUtest/bin/python 21559MiB |
|    7      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    7      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    7     23491      C   ...se170020/.conda/envs/GPUtest/bin/python 21559MiB |
+-----------------------------------------------------------------------------+

[root@gpu3 ~]# nvidia-smi
Tue Sep  1 14:24:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:14:00.0 Off |                    0 |
| N/A   31C    P0    49W / 250W |  21950MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:15:00.0 Off |                    0 |
| N/A   40C    P0    48W / 250W |  21923MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:39:00.0 Off |                    0 |
| N/A   34C    P0    49W / 250W |  21923MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   31C    P0    49W / 250W |   1699MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 00000000:88:00.0 Off |                    0 |
| N/A   24C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 00000000:89:00.0 Off |                    0 |
| N/A   26C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   28C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P40           Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   25C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6714      C   /opt/gpudb/core/bin/gpudb_cluster_cuda       501MiB |
|    0      6716    C+G   /opt/gpudb/core/bin/gpudb_cluster_cuda     21430MiB |
|    1      6718    C+G   /opt/gpudb/core/bin/gpudb_cluster_cuda     21904MiB |
|    2      6720    C+G   /opt/gpudb/core/bin/gpudb_cluster_cuda     21904MiB |
|    3      4062      C   /opt/conda/envs/rapids/bin/python            319MiB |
|    3     28474      C   /opt/conda/envs/rapids/bin/python            377MiB |
|    3     31448      C   /opt/conda/envs/rapids/bin/python            271MiB |
|    3     33893      C   /opt/conda/envs/rapids/bin/python            229MiB |
|    3     47777      C   /opt/conda/envs/rapids/bin/python            399MiB |
+-----------------------------------------------------------------------------+

Thanks for your help :)

Originally posted by @Dubrzr in https://github.com/roscisz/TensorHive/issues/286#issuecomment-684814071

roscisz avatar Sep 01 '20 12:09 roscisz

@Dubrzr could you please provide the output of the following command on your gpu2 server:

awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'

roscisz avatar Sep 01 '20 12:09 roscisz

It looks like it works fine:

[root@gpu1 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
3.08784
Mem:         192925      166538        7237        5040       19150       19906
[root@gpu2 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
6.61519
Mem:         192925      159153        1300        6254       32471       26015
[root@gpu3 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
1.24818
Mem:         192925       40694        1031       28797      151198      122699

Dubrzr avatar Sep 01 '20 12:09 Dubrzr

Thanks... wrong intuition then...

This is indeed the right endpoint. My sample output:

{
  "ai": {
    "CPU": {
      "CPU_ai": {
        "index": 0,
        "metrics": {
          "mem_free": {
            "unit": "MiB",
            "value": 30806
          },
          "mem_total": {
            "unit": "MiB",
            "value": 257868
          },
          "mem_used": {
            "unit": "MiB",
            "value": 44770
          },
          "utilization": {
            "unit": "%",
            "value": 5.62359
          }
        }
      }
    },
    "GPU": {
      "GPU-5db488ff-6728-fb07-93be-ee423d4ab086": {
        "index": 3,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 1824
          },
          "mem_total": {
            "unit": "MiB",
            "value": 16128
          },
          "mem_used": {
            "unit": "MiB",
            "value": 14304
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "62.97"
          },
          "temp": 46,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla V100-DGXS-16GB",
        "processes": [
          {
            "command": "python3",
            "owner": "macsakow",
            "pid": 22294
          }
        ]
      },
      "GPU-6602a81f-14bf-4ee5-2992-45a7e519802b": {
        "index": 1,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 16117
          },
          "mem_total": {
            "unit": "MiB",
            "value": 16128
          },
          "mem_used": {
            "unit": "MiB",
            "value": 11
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "39.85"
          },
          "temp": 44,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla V100-DGXS-16GB",
        "processes": [
          {
            "command": "-",
            "owner": null,
            "pid": "-"
          }
        ]
      },
      "GPU-cc813906-1a7c-e941-ccf8-1a40cd99ccfb": {
        "index": 2,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 1824
          },
          "mem_total": {
            "unit": "MiB",
            "value": 16128
          },
          "mem_used": {
            "unit": "MiB",
            "value": 14304
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "52.27"
          },
          "temp": 45,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla V100-DGXS-16GB",
        "processes": [
          {
            "command": "python3",
            "owner": "macsakow",
            "pid": 22233
          }
        ]
      },
      "GPU-d1ae8368-a34f-3afc-14aa-01afbc0fa787": {
        "index": 0,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 16113
          },
          "mem_total": {
            "unit": "MiB",
            "value": 16125
          },
          "mem_used": {
            "unit": "MiB",
            "value": 12
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "39.27"
          },
          "temp": 44,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla V100-DGXS-16GB",
        "processes": [
          {
            "command": "-",
            "owner": null,
            "pid": "-"
          }
        ]
      }
    }
  }
}

@micmarty: any ideas? While nvidia-smi output looks fine and there is no error message from GPUMonitor, there are no 'GPU' entries in metrics.

roscisz avatar Sep 01 '20 13:09 roscisz

@Dubrzr: and how about this command:

nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits

I see that you have a newer version of NVIDIA driver (the newest version that we've tested is 418.116), maybe there have also been some changes to nvidia-smi...

roscisz avatar Sep 01 '20 13:09 roscisz

Here it is: :)

[root@gpu1 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 100
Tesla P40, [Not Supported], 99
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0

[root@gpu2 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 99
Tesla P40, [Not Supported], 99

[root@gpu3 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0

Dubrzr avatar Sep 01 '20 14:09 Dubrzr

Everything looks fine here...

Could you try modifying line 73 in tensorhive/core/managers/TensorHiveManager.py and set:

monitors = []

and see if it helps?

roscisz avatar Sep 01 '20 14:09 roscisz

Yep! It indeed works better :tada: But gpu2 don't :

{
  "gpu1": {
    "GPU": {
      "GPU-373e1376-924e-2bfc-1f62-064eaaccd10d": {
        "index": 3,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 1033
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21886
          },
          "mem_util": {
            "unit": "%",
            "value": 75
          },
          "power": {
            "unit": "W",
            "value": "225.63"
          },
          "temp": 52,
          "utilization": {
            "unit": "%",
            "value": 100
          }
        },
        "name": "Tesla P40"
      },
      "GPU-3d85c94f-202d-12a7-4bbc-9649c35ad28c": {
        "index": 6,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "10.02"
          },
          "temp": 25,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-6110d200-8918-9021-c962-74c486c34b0c": {
        "index": 1,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 8784
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 14135
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "49.36"
          },
          "temp": 36,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-71a449d4-02dc-ba5a-7006-59039b9895f9": {
        "index": 0,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 963
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21956
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "49.52"
          },
          "temp": 28,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-77749c57-33ca-d5fd-705a-f037e459a3ba": {
        "index": 4,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.63"
          },
          "temp": 23,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-c111d493-b7b5-89cf-19bb-5f25aad9b12f": {
        "index": 7,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.54"
          },
          "temp": 22,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-e0f615fb-f81e-b1c8-6d10-60c160bc30f8": {
        "index": 5,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 4256
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 18663
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "50.32"
          },
          "temp": 30,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-ef774f0c-ed68-01bb-5b42-2b54ed482ba1": {
        "index": 2,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 1070
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21849
          },
          "mem_util": {
            "unit": "%",
            "value": 74
          },
          "power": {
            "unit": "W",
            "value": "130.64"
          },
          "temp": 55,
          "utilization": {
            "unit": "%",
            "value": 99
          }
        },
        "name": "Tesla P40"
      }
    }
  },
  "gpu2": {
    "GPU": null
  },
  "gpu3": {
    "GPU": {
      "GPU-13f10865-a510-351c-65d5-1630c9d0a941": {
        "index": 5,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.84"
          },
          "temp": 26,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-1d351c42-ad31-d8fd-79ad-3db6a5d375a2": {
        "index": 3,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 21220
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 1699
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "49.16"
          },
          "temp": 31,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-29d32502-07b6-21a4-b655-e06d2137d248": {
        "index": 6,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "10.33"
          },
          "temp": 28,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-798ca8f5-3a20-5e7a-098a-79b749df27d6": {
        "index": 2,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 996
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21923
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "50.06"
          },
          "temp": 35,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-7a4f2061-9efa-63ef-446f-15e385a2818d": {
        "index": 1,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 996
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21923
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "48.88"
          },
          "temp": 40,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-a1469d32-daf1-8187-de12-c3bea645073a": {
        "index": 0,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 969
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21950
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "49.87"
          },
          "temp": 31,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-dfb43b29-8262-e134-3c80-1ff955b2e0de": {
        "index": 7,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.75"
          },
          "temp": 25,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-f9811f7e-662d-ba70-fc06-f8c713358fc6": {
        "index": 4,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.85"
          },
          "temp": 25,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      }
    }
  }
}

Dubrzr avatar Sep 01 '20 14:09 Dubrzr

@Dubrzr do you have any new observations or hints?

If the data was lacking for gpu3, we would at least have an idea that the differing Fan speed "[N/A]" notation is not parsed properly. And with the proper output from nvidia-smi on gpu2, we currently have no ideas how to help...

What is the OS user account used by tensorhive? nvidia-smi works properly for root user, but does it also work for the user account used by TH on gpu2?

roscisz avatar Sep 11 '20 11:09 roscisz