zeus icon indicating copy to clipboard operation
zeus copied to clipboard

Support for AMD GPUs

Open jaywonchung opened this issue 2 years ago • 1 comments

AMD GPUs are seeing increased adoption. ROCm has nice compatibility layers with PyTorch, too. Plus, ROCm-SMI (apparently) has all the energy-related management APIs we need -- measuring power and energy, and setting the power limit and GPU frequency.

First, we should evaluate whether the ROCm-SMI APIs behave like their NVML counterparts and show similar time/energy behavior with NVIDIA GPUs.

Then, implementatino-wise, ROCm-SMI lacks a Python package (something like nvidia-smi-py that NVIDIA officially provides). We should probably package that ourselves (much like the pynvml, which is a community-supported Python binding for NVML), until AMD provides an official binding.

jaywonchung avatar Oct 08 '23 19:10 jaywonchung

Progress

  • [x] Abstraction layer over GPUs (#46)
  • [x] AMDSMI + ROCm 6.0 implementations of management APIs (#57)
  • [x] Track down power/energy API issue (https://github.com/ROCm/amdsmi/issues/22)
  • [ ] The cumulative energy counter works on MI200, MI210, MI250, and MI300x but not on MI100 (https://github.com/ROCm/amdsmi/issues/38) -- Issue warning after checking GPU board info?
  • [ ] Test measurement and optimization on AMD GPUs

jaywonchung avatar May 02 '24 15:05 jaywonchung