SwanLab icon indicating copy to clipboard operation
SwanLab copied to clipboard

Discussion on 【Hardware Monitoring】

Open Zeyi-Lin opened this issue 1 year ago • 4 comments

🤩 Features description [Please make everyone to understand it]

研究者比较关心的监控指标主要包含:

  • GPU利用率
  • GPU显存占用
  • 内存占用
  • 磁盘利用率
  • 磁盘IO
  • CPU内存
  • CPU利用率
  • 显卡温度
  • ...

细粒度:

  • 整个进程占用的硬件情况
  • 程序中每个网络模块所占用的硬件情况(一般是GPU显存相关的)

之前用的工具有:

  • gpustat: 精细监控每个user的GPU使用情况
  • PyTorch hook: https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_forward_hook
  • Profiler: https://pytorch.org/tutorials/beginner/profiler.html

ps:

  • 计算flops的包: https://github.com/facebookresearch/fvcore/blob/main/docs/flop_count.md
  • FLOPS计算文章:https://zhuanlan.zhihu.com/p/663566912

Zeyi-Lin avatar Jan 03 '24 13:01 Zeyi-Lin

I think disk utilization, disk io. CPU memory, CPU utilization are also necessary

ZhikangNiu avatar Jan 04 '24 02:01 ZhikangNiu

I think disk utilization, disk io. CPU memory, CPU utilization are also necessary

🍺Get,added to the top floor.

Zeyi-Lin avatar Jan 04 '24 06:01 Zeyi-Lin

I think the temperature of hardware is also needed.

KashiwaByte avatar Jan 14 '24 08:01 KashiwaByte

https://github.com/grafana/grafana Is a good reference example. I was attracted by the features of this software the first time I used it. But the technology they use is not compatible with python.

Puiching-Memory avatar Aug 11 '24 04:08 Puiching-Memory