watchme icon indicating copy to clipboard operation
watchme copied to clipboard

Monitor nvidia-smi output to see GPU resource consumption

Open samhodge-aiml opened this issue 3 months ago • 2 comments

Is your feature request related to a problem? Please describe. I need to see how much VRAM and GPU compute are being used by a process in a container, and have a historical record in a sql table to continue to narrow the gap between resources allocated and resources consumed

Describe the solution you'd like I would like to be able to wrap the output of nvidia-smi and have it come out in the same dictionary or a side car type concept for the rest of the watchme metrics

Describe alternatives you've considered Use the following https://github.com/petronny/nvsmi and dump that into a dictionary at the same time as the watchme decorator

Additional context Getting computation to match the resources allocated closely is a problem with commercial value, anyone who makes use of GPUs should be interested in how much these resources are occupied because buying and renting them is not cheap

samhodge-aiml avatar Mar 13 '24 06:03 samhodge-aiml

Sorry I found the correct documentation

https://github.com/vsoch/watchme/blob/f209d3d4bf99a25cd2dcaeaa2431cf3ecfe68585/docs/_docs/watcher-tasks/gpu.md#use-as-a-decorator

samhodge-aiml avatar Mar 14 '24 06:03 samhodge-aiml

hey @samhodge-aiml ! This seems like a cool idea (and simple to implement) but I'm not sure I'll have time to work on it soon - too many cool things going on <3

vsoch avatar Mar 15 '24 05:03 vsoch