TensorHive
TensorHive copied to clipboard
Missing 'GPU' entries in metrics
Hmmm I cannot see GPUs even when I click on the "+" sign, nothing happens, I guess the http request used is "/api/0.3.1/nodes/metrics" ? If so here are the contents of the response:
{
"gpu1.***.com": {
"CPU": {
"CPU_gpu1.***.com": {
"index": 0,
"metrics": {
"mem_free": {
"unit": "MiB",
"value": 1295
},
"mem_total": {
"unit": "MiB",
"value": 192925
},
"mem_used": {
"unit": "MiB",
"value": 152775
},
"utilization": {
"unit": "%",
"value": 6.15449
}
}
}
}
},
"gpu2.***.com": {
"CPU": null
},
"gpu3.***.com": {
"CPU": {
"CPU_gpu3.***.com": {
"index": 0,
"metrics": {
"mem_free": {
"unit": "MiB",
"value": 1039
},
"mem_total": {
"unit": "MiB",
"value": 192925
},
"mem_used": {
"unit": "MiB",
"value": 40694
},
"utilization": {
"unit": "%",
"value": 1.45833
}
}
}
}
}
}
I replaced all hostnames with fakes ones.
Here are free -m results on all my machines: (OS are Centos 7):
[root@gpu1 ~]# free -m
total used free shared buff/cache available
Mem: 192925 166473 7244 5040 19207 19968
Swap: 15931 14994 937
[root@gpu2 ~]# free -m
total used free shared buff/cache available
Mem: 192925 158857 1487 6334 32579 26228
Swap: 15931 7748 8183
[root@gpu3 ~]# free -m
total used free shared buff/cache available
Mem: 192925 40692 1036 28797 151196 122702
Swap: 15931 0 15931
Also nvidia-smi:
[root@gpu1 ~]# nvidia-smi
Tue Sep 1 14:23:26 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:14:00.0 Off | 0 |
| N/A 28C P0 49W / 250W | 21956MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:15:00.0 Off | 0 |
| N/A 36C P0 49W / 250W | 14135MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:39:00.0 Off | 0 |
| N/A 54C P0 163W / 250W | 21849MiB / 22919MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:3A:00.0 Off | 0 |
| N/A 52C P0 177W / 250W | 21886MiB / 22919MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P40 Off | 00000000:88:00.0 Off | 0 |
| N/A 23C P8 9W / 250W | 10MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P40 Off | 00000000:89:00.0 Off | 0 |
| N/A 30C P0 50W / 250W | 18663MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P40 Off | 00000000:B1:00.0 Off | 0 |
| N/A 25C P8 10W / 250W | 10MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla P40 Off | 00000000:B2:00.0 Off | 0 |
| N/A 22C P8 9W / 250W | 10MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4301 C ...er_conda/miniconda/envs/CCR/bin/python3 18787MiB |
| 0 44728 C ...se180025/.conda/envs/ner_env/bin/python 3157MiB |
| 1 16768 C ...er_conda/miniconda/envs/CCR/bin/python3 14125MiB |
| 2 39814 C ...se170020/.conda/envs/GPUtest/bin/python 21837MiB |
| 3 4301 C ...er_conda/miniconda/envs/CCR/bin/python3 619MiB |
| 3 23651 C ...se170020/.conda/envs/GPUtest/bin/python 21255MiB |
| 5 4301 C ...er_conda/miniconda/envs/CCR/bin/python3 18653MiB |
+-----------------------------------------------------------------------------+
[root@gpu2 ~]# nvidia-smi
Tue Sep 1 14:23:53 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:14:00.0 Off | 0 |
| N/A 28C P0 48W / 250W | 21926MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:15:00.0 Off | 0 |
| N/A 33C P0 49W / 250W | 306MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:39:00.0 Off | 0 |
| N/A 29C P0 49W / 250W | 306MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:3A:00.0 Off | 0 |
| N/A 28C P0 49W / 250W | 306MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P40 Off | 00000000:88:00.0 Off | 0 |
| N/A 33C P0 48W / 250W | 306MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P40 Off | 00000000:89:00.0 Off | 0 |
| N/A 25C P0 48W / 250W | 306MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P40 Off | 00000000:B1:00.0 Off | 0 |
| N/A 52C P0 133W / 250W | 21867MiB / 22919MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla P40 Off | 00000000:B2:00.0 Off | 0 |
| N/A 47C P0 207W / 250W | 21867MiB / 22919MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8757 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 0 8763 C ...i/.conda/envs/pyenv_dotmani/bin/python3 21767MiB |
| 1 8757 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 1 8763 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 2 8757 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 2 8763 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 3 8757 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 3 8763 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 4 8757 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 4 8763 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 5 8757 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 5 8763 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 6 8757 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 6 8763 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 6 27039 C ...se170020/.conda/envs/GPUtest/bin/python 21559MiB |
| 7 8757 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 7 8763 C ...i/.conda/envs/pyenv_dotmani/bin/python3 147MiB |
| 7 23491 C ...se170020/.conda/envs/GPUtest/bin/python 21559MiB |
+-----------------------------------------------------------------------------+
[root@gpu3 ~]# nvidia-smi
Tue Sep 1 14:24:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:14:00.0 Off | 0 |
| N/A 31C P0 49W / 250W | 21950MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:15:00.0 Off | 0 |
| N/A 40C P0 48W / 250W | 21923MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:39:00.0 Off | 0 |
| N/A 34C P0 49W / 250W | 21923MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:3A:00.0 Off | 0 |
| N/A 31C P0 49W / 250W | 1699MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P40 Off | 00000000:88:00.0 Off | 0 |
| N/A 24C P8 9W / 250W | 10MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P40 Off | 00000000:89:00.0 Off | 0 |
| N/A 26C P8 9W / 250W | 10MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P40 Off | 00000000:B1:00.0 Off | 0 |
| N/A 28C P8 9W / 250W | 10MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla P40 Off | 00000000:B2:00.0 Off | 0 |
| N/A 25C P8 9W / 250W | 10MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6714 C /opt/gpudb/core/bin/gpudb_cluster_cuda 501MiB |
| 0 6716 C+G /opt/gpudb/core/bin/gpudb_cluster_cuda 21430MiB |
| 1 6718 C+G /opt/gpudb/core/bin/gpudb_cluster_cuda 21904MiB |
| 2 6720 C+G /opt/gpudb/core/bin/gpudb_cluster_cuda 21904MiB |
| 3 4062 C /opt/conda/envs/rapids/bin/python 319MiB |
| 3 28474 C /opt/conda/envs/rapids/bin/python 377MiB |
| 3 31448 C /opt/conda/envs/rapids/bin/python 271MiB |
| 3 33893 C /opt/conda/envs/rapids/bin/python 229MiB |
| 3 47777 C /opt/conda/envs/rapids/bin/python 399MiB |
+-----------------------------------------------------------------------------+
Thanks for your help :)
Originally posted by @Dubrzr in https://github.com/roscisz/TensorHive/issues/286#issuecomment-684814071
@Dubrzr could you please provide the output of the following command on your gpu2 server:
awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
It looks like it works fine:
[root@gpu1 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
3.08784
Mem: 192925 166538 7237 5040 19150 19906
[root@gpu2 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
6.61519
Mem: 192925 159153 1300 6254 32471 26015
[root@gpu3 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
1.24818
Mem: 192925 40694 1031 28797 151198 122699
Thanks... wrong intuition then...
This is indeed the right endpoint. My sample output:
{
"ai": {
"CPU": {
"CPU_ai": {
"index": 0,
"metrics": {
"mem_free": {
"unit": "MiB",
"value": 30806
},
"mem_total": {
"unit": "MiB",
"value": 257868
},
"mem_used": {
"unit": "MiB",
"value": 44770
},
"utilization": {
"unit": "%",
"value": 5.62359
}
}
}
},
"GPU": {
"GPU-5db488ff-6728-fb07-93be-ee423d4ab086": {
"index": 3,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 1824
},
"mem_total": {
"unit": "MiB",
"value": 16128
},
"mem_used": {
"unit": "MiB",
"value": 14304
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "62.97"
},
"temp": 46,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla V100-DGXS-16GB",
"processes": [
{
"command": "python3",
"owner": "macsakow",
"pid": 22294
}
]
},
"GPU-6602a81f-14bf-4ee5-2992-45a7e519802b": {
"index": 1,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 16117
},
"mem_total": {
"unit": "MiB",
"value": 16128
},
"mem_used": {
"unit": "MiB",
"value": 11
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "39.85"
},
"temp": 44,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla V100-DGXS-16GB",
"processes": [
{
"command": "-",
"owner": null,
"pid": "-"
}
]
},
"GPU-cc813906-1a7c-e941-ccf8-1a40cd99ccfb": {
"index": 2,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 1824
},
"mem_total": {
"unit": "MiB",
"value": 16128
},
"mem_used": {
"unit": "MiB",
"value": 14304
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "52.27"
},
"temp": 45,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla V100-DGXS-16GB",
"processes": [
{
"command": "python3",
"owner": "macsakow",
"pid": 22233
}
]
},
"GPU-d1ae8368-a34f-3afc-14aa-01afbc0fa787": {
"index": 0,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 16113
},
"mem_total": {
"unit": "MiB",
"value": 16125
},
"mem_used": {
"unit": "MiB",
"value": 12
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "39.27"
},
"temp": 44,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla V100-DGXS-16GB",
"processes": [
{
"command": "-",
"owner": null,
"pid": "-"
}
]
}
}
}
}
@micmarty: any ideas? While nvidia-smi output looks fine and there is no error message from GPUMonitor, there are no 'GPU' entries in metrics.
@Dubrzr: and how about this command:
nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
I see that you have a newer version of NVIDIA driver (the newest version that we've tested is 418.116), maybe there have also been some changes to nvidia-smi...
Here it is: :)
[root@gpu1 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 100
Tesla P40, [Not Supported], 99
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
[root@gpu2 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 99
Tesla P40, [Not Supported], 99
[root@gpu3 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Everything looks fine here...
Could you try modifying line 73 in tensorhive/core/managers/TensorHiveManager.py and set:
monitors = []
and see if it helps?
Yep! It indeed works better :tada: But gpu2 don't :
{
"gpu1": {
"GPU": {
"GPU-373e1376-924e-2bfc-1f62-064eaaccd10d": {
"index": 3,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 1033
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 21886
},
"mem_util": {
"unit": "%",
"value": 75
},
"power": {
"unit": "W",
"value": "225.63"
},
"temp": 52,
"utilization": {
"unit": "%",
"value": 100
}
},
"name": "Tesla P40"
},
"GPU-3d85c94f-202d-12a7-4bbc-9649c35ad28c": {
"index": 6,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 22909
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 10
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "10.02"
},
"temp": 25,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-6110d200-8918-9021-c962-74c486c34b0c": {
"index": 1,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 8784
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 14135
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "49.36"
},
"temp": 36,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-71a449d4-02dc-ba5a-7006-59039b9895f9": {
"index": 0,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 963
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 21956
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "49.52"
},
"temp": 28,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-77749c57-33ca-d5fd-705a-f037e459a3ba": {
"index": 4,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 22909
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 10
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "9.63"
},
"temp": 23,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-c111d493-b7b5-89cf-19bb-5f25aad9b12f": {
"index": 7,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 22909
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 10
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "9.54"
},
"temp": 22,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-e0f615fb-f81e-b1c8-6d10-60c160bc30f8": {
"index": 5,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 4256
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 18663
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "50.32"
},
"temp": 30,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-ef774f0c-ed68-01bb-5b42-2b54ed482ba1": {
"index": 2,
"metrics": {
"fan_speed": {
"unit": "%",
"value": null
},
"mem_free": {
"unit": "MiB",
"value": 1070
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 21849
},
"mem_util": {
"unit": "%",
"value": 74
},
"power": {
"unit": "W",
"value": "130.64"
},
"temp": 55,
"utilization": {
"unit": "%",
"value": 99
}
},
"name": "Tesla P40"
}
}
},
"gpu2": {
"GPU": null
},
"gpu3": {
"GPU": {
"GPU-13f10865-a510-351c-65d5-1630c9d0a941": {
"index": 5,
"metrics": {
"fan_speed": {
"unit": "%",
"value": "[N/A]"
},
"mem_free": {
"unit": "MiB",
"value": 22909
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 10
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "9.84"
},
"temp": 26,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-1d351c42-ad31-d8fd-79ad-3db6a5d375a2": {
"index": 3,
"metrics": {
"fan_speed": {
"unit": "%",
"value": "[N/A]"
},
"mem_free": {
"unit": "MiB",
"value": 21220
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 1699
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "49.16"
},
"temp": 31,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-29d32502-07b6-21a4-b655-e06d2137d248": {
"index": 6,
"metrics": {
"fan_speed": {
"unit": "%",
"value": "[N/A]"
},
"mem_free": {
"unit": "MiB",
"value": 22909
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 10
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "10.33"
},
"temp": 28,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-798ca8f5-3a20-5e7a-098a-79b749df27d6": {
"index": 2,
"metrics": {
"fan_speed": {
"unit": "%",
"value": "[N/A]"
},
"mem_free": {
"unit": "MiB",
"value": 996
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 21923
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "50.06"
},
"temp": 35,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-7a4f2061-9efa-63ef-446f-15e385a2818d": {
"index": 1,
"metrics": {
"fan_speed": {
"unit": "%",
"value": "[N/A]"
},
"mem_free": {
"unit": "MiB",
"value": 996
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 21923
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "48.88"
},
"temp": 40,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-a1469d32-daf1-8187-de12-c3bea645073a": {
"index": 0,
"metrics": {
"fan_speed": {
"unit": "%",
"value": "[N/A]"
},
"mem_free": {
"unit": "MiB",
"value": 969
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 21950
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "49.87"
},
"temp": 31,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-dfb43b29-8262-e134-3c80-1ff955b2e0de": {
"index": 7,
"metrics": {
"fan_speed": {
"unit": "%",
"value": "[N/A]"
},
"mem_free": {
"unit": "MiB",
"value": 22909
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 10
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "9.75"
},
"temp": 25,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
},
"GPU-f9811f7e-662d-ba70-fc06-f8c713358fc6": {
"index": 4,
"metrics": {
"fan_speed": {
"unit": "%",
"value": "[N/A]"
},
"mem_free": {
"unit": "MiB",
"value": 22909
},
"mem_total": {
"unit": "MiB",
"value": 22919
},
"mem_used": {
"unit": "MiB",
"value": 10
},
"mem_util": {
"unit": "%",
"value": 0
},
"power": {
"unit": "W",
"value": "9.85"
},
"temp": 25,
"utilization": {
"unit": "%",
"value": 0
}
},
"name": "Tesla P40"
}
}
}
}
@Dubrzr do you have any new observations or hints?
If the data was lacking for gpu3, we would at least have an idea that the differing Fan speed "[N/A]" notation is not parsed properly. And with the proper output from nvidia-smi on gpu2, we currently have no ideas how to help...
What is the OS user account used by tensorhive? nvidia-smi works properly for root user, but does it also work for the user account used by TH on gpu2?