SwanLab
SwanLab copied to clipboard
Discussion on 【Hardware Monitoring】
🤩 Features description [Please make everyone to understand it]
研究者比较关心的监控指标主要包含:
- GPU利用率
- GPU显存占用
- 内存占用
- 磁盘利用率
- 磁盘IO
- CPU内存
- CPU利用率
- 显卡温度
- ...
细粒度:
- 整个进程占用的硬件情况
- 程序中每个网络模块所占用的硬件情况(一般是GPU显存相关的)
之前用的工具有:
-
gpustat
: 精细监控每个user的GPU使用情况 -
PyTorch hook
: https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_forward_hook -
Profiler
: https://pytorch.org/tutorials/beginner/profiler.html
ps:
- 计算flops的包: https://github.com/facebookresearch/fvcore/blob/main/docs/flop_count.md
- FLOPS计算文章:https://zhuanlan.zhihu.com/p/663566912
I think disk utilization, disk io. CPU memory, CPU utilization are also necessary
I think disk utilization, disk io. CPU memory, CPU utilization are also necessary
🍺Get,added to the top floor.
I think the temperature of hardware is also needed.
https://github.com/grafana/grafana Is a good reference example. I was attracted by the features of this software the first time I used it. But the technology they use is not compatible with python.