gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Tesla T4 支持进程内存监控采集吗?

Open kelonsen opened this issue 3 years ago • 2 comments

产品:目前使用英伟达 Tesla T4显卡 问题: 目前通过 nvidia-smi 可以看到显存大小和某个进程使用的显存大小; https://files.51wyq.cn/tmp/image001.png

通过gpu exporter 也可以监控到每个显卡显存的动态使用情况, https://files.51wyq.cn/tmp/image002.png

由于我们是多个进程同时使用一块显卡,无法检测到进程的显存动态使用情况。 请问有什么工具可以直接检测进程的显存使用情况,希望某个进程的显存使用情况也可以绘制成图,请问有没有现成工具(我们目前用了kubernets),谢谢! https://files.51wyq.cn/tmp/image003.png

kelonsen avatar Aug 24 '22 11:08 kelonsen

@kelonsen did you look into the dcgm-exporter metrics collected for memory utilization? https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/etc/dcgm-exporter/dcp-metrics-included.csv Also, metrics are mapped to pod-level resources to track the usage per pod(pod-name, namespace, device-id). This blog is bit dated but still relevant and documentation here.

shivamerla avatar Aug 24 '22 16:08 shivamerla

@kelonsen did you look into the dcgm-exporter metrics collected for memory utilization? https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/etc/dcgm-exporter/dcp-metrics-included.csv Also, metrics are mapped to pod-level resources to track the usage per pod(pod-name, namespace, device-id). This blog is bit dated but still relevant and documentation here.

您好,首先感谢您的回答!我刚看了下 https://github.com/NVIDIA/dcgm-exporter 只是针对GPU设备的,没有收集运行在GPU设备上业务进程/业务POD的监控信息,其实我想监控某个进程用了多少显存,谢谢!

kelonsen avatar Aug 26 '22 03:08 kelonsen