PaddleCloud
PaddleCloud copied to clipboard
The GPU Capacity of hosts and `kubectl describe node` does not match
-
kubectl describe node
Capacity:
alpha.kubernetes.io/nvidia-gpu: 8
cpu: 24
memory: 264042760Ki
pods: 110
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 8
cpu: 24
memory: 263940360Ki
pods: 110
System Info:
- Host info
$ nvidia-smi
Wed Aug 23 23:26:17 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 0000:04:00.0 Off | 0 |
| N/A 34C P0 53W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 0000:05:00.0 Off | 0 |
| N/A 29C P0 50W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 0000:06:00.0 Off | 0 |
| N/A 33C P0 51W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 0000:07:00.0 Off | 0 |
| N/A 35C P0 51W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P40 Off | 0000:0B:00.0 Off | 0 |
| N/A 33C P0 51W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P40 Off | 0000:0C:00.0 Off | 0 |
| N/A 32C P0 50W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P40 Off | 0000:0E:00.0 Off | 0 |
| N/A 29C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
- Device info
$ ls /dev/nvidia*
/dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /dev/nvidia4 /dev/nvidia5 /dev/nvidia6 /dev/nvidia7 /dev/nvidiactl /dev/nvidia-uv
I also create an issue on Nvidia develop forum: https://devtalk.nvidia.com/default/topic/1023105/general/nvidia-smi-and-dev-nvidia-does-not-match/
这是其中一个卡在运行时丢了,dmesg 应该能看到一些信息
On Thu, Aug 24, 2017 at 10:43 AM, Yancey [email protected] wrote:
I also create an issue on Nvidia develop forum: https://devtalk.nvidia.com/default/topic/1023105/general/ nvidia-smi-and-dev-nvidia-does-not-match/
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PaddlePaddle/cloud/issues/344#issuecomment-324517161, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEcOOA0gyWdFtrcJsSFuIofm6KvqjdQks5sbONpgaJpZM4PAMcn .
--
Qingsong Liu [email protected] Univ. of Sci.& Tech. of China
现在 kubelet 检测 GPU 数目比较裸,直接匹配 /dev/nvidia[0-9]+ 的数目,根据运行状态检测 GPU 数目,要等到 v1.8 或者 v1.9
On Thu, Aug 24, 2017 at 11:03 AM, Qingsong Liu [email protected] wrote:
这是其中一个卡在运行时丢了,dmesg 应该能看到一些信息
On Thu, Aug 24, 2017 at 10:43 AM, Yancey [email protected] wrote:
I also create an issue on Nvidia develop forum: https://devtalk.nvidia.com/default/topic/1023105/general/nvi dia-smi-and-dev-nvidia-does-not-match/
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PaddlePaddle/cloud/issues/344#issuecomment-324517161, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEcOOA0gyWdFtrcJsSFuIofm6KvqjdQks5sbONpgaJpZM4PAMcn .
--
Qingsong Liu [email protected] Univ. of Sci.& Tech. of China
--
Qingsong Liu [email protected] Univ. of Sci.& Tech. of China
多谢 @pineking dmesg 里确实有初始化失败的日志:
[1903878.128627] NVRM: RmInitAdapter failed! (0x26:0xffff:1096)
[1903878.128678] NVRM: rm_init_adapter failed for device bearing minor number 6
你们有遇到过类似问题么?
之前碰到过,重启解决,最近没出现丢卡情况 没继续找根源。。。