PaddleCloud icon indicating copy to clipboard operation
PaddleCloud copied to clipboard

The GPU Capacity of hosts and `kubectl describe node` does not match

Open Yancey1989 opened this issue 7 years ago • 5 comments

  • kubectl describe node
Capacity:
 alpha.kubernetes.io/nvidia-gpu:	8
 cpu:					24
 memory:				264042760Ki
 pods:					110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:	8
 cpu:					24
 memory:				263940360Ki
 pods:					110
System Info:
  • Host info
$ nvidia-smi
Wed Aug 23 23:26:17 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 0000:04:00.0     Off |                    0 |
| N/A   34C    P0    53W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 0000:05:00.0     Off |                    0 |
| N/A   29C    P0    50W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 0000:06:00.0     Off |                    0 |
| N/A   33C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 0000:07:00.0     Off |                    0 |
| N/A   35C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 0000:0B:00.0     Off |                    0 |
| N/A   33C    P0    51W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 0000:0C:00.0     Off |                    0 |
| N/A   32C    P0    50W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 0000:0E:00.0     Off |                    0 |
| N/A   29C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
  • Device info
$ ls /dev/nvidia*
/dev/nvidia0  /dev/nvidia1  /dev/nvidia2  /dev/nvidia3  /dev/nvidia4  /dev/nvidia5  /dev/nvidia6  /dev/nvidia7  /dev/nvidiactl  /dev/nvidia-uv

Yancey1989 avatar Aug 23 '17 15:08 Yancey1989

I also create an issue on Nvidia develop forum: https://devtalk.nvidia.com/default/topic/1023105/general/nvidia-smi-and-dev-nvidia-does-not-match/

Yancey1989 avatar Aug 24 '17 02:08 Yancey1989

这是其中一个卡在运行时丢了,dmesg 应该能看到一些信息

On Thu, Aug 24, 2017 at 10:43 AM, Yancey [email protected] wrote:

I also create an issue on Nvidia develop forum: https://devtalk.nvidia.com/default/topic/1023105/general/ nvidia-smi-and-dev-nvidia-does-not-match/

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PaddlePaddle/cloud/issues/344#issuecomment-324517161, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEcOOA0gyWdFtrcJsSFuIofm6KvqjdQks5sbONpgaJpZM4PAMcn .

--

Qingsong Liu [email protected] Univ. of Sci.& Tech. of China

pineking avatar Aug 24 '17 03:08 pineking

现在 kubelet 检测 GPU 数目比较裸,直接匹配 /dev/nvidia[0-9]+ 的数目,根据运行状态检测 GPU 数目,要等到 v1.8 或者 v1.9

On Thu, Aug 24, 2017 at 11:03 AM, Qingsong Liu [email protected] wrote:

这是其中一个卡在运行时丢了,dmesg 应该能看到一些信息

On Thu, Aug 24, 2017 at 10:43 AM, Yancey [email protected] wrote:

I also create an issue on Nvidia develop forum: https://devtalk.nvidia.com/default/topic/1023105/general/nvi dia-smi-and-dev-nvidia-does-not-match/

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PaddlePaddle/cloud/issues/344#issuecomment-324517161, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEcOOA0gyWdFtrcJsSFuIofm6KvqjdQks5sbONpgaJpZM4PAMcn .

--

Qingsong Liu [email protected] Univ. of Sci.& Tech. of China

--

Qingsong Liu [email protected] Univ. of Sci.& Tech. of China

pineking avatar Aug 24 '17 03:08 pineking

多谢 @pineking dmesg 里确实有初始化失败的日志:

[1903878.128627] NVRM: RmInitAdapter failed! (0x26:0xffff:1096)
[1903878.128678] NVRM: rm_init_adapter failed for device bearing minor number 6

你们有遇到过类似问题么?

Yancey1989 avatar Aug 24 '17 03:08 Yancey1989

之前碰到过,重启解决,最近没出现丢卡情况 没继续找根源。。。

pineking avatar Aug 24 '17 03:08 pineking