watchme
watchme copied to clipboard
Monitor nvidia-smi output to see GPU resource consumption
Is your feature request related to a problem? Please describe. I need to see how much VRAM and GPU compute are being used by a process in a container, and have a historical record in a sql table to continue to narrow the gap between resources allocated and resources consumed
Describe the solution you'd like I would like to be able to wrap the output of nvidia-smi and have it come out in the same dictionary or a side car type concept for the rest of the watchme metrics
Describe alternatives you've considered Use the following https://github.com/petronny/nvsmi and dump that into a dictionary at the same time as the watchme decorator
Additional context Getting computation to match the resources allocated closely is a problem with commercial value, anyone who makes use of GPUs should be interested in how much these resources are occupied because buying and renting them is not cheap
Sorry I found the correct documentation
https://github.com/vsoch/watchme/blob/f209d3d4bf99a25cd2dcaeaa2431cf3ecfe68585/docs/_docs/watcher-tasks/gpu.md#use-as-a-decorator
hey @samhodge-aiml ! This seems like a cool idea (and simple to implement) but I'm not sure I'll have time to work on it soon - too many cool things going on <3