Support for AMD GPUs
AMD GPUs are seeing increased adoption. ROCm has nice compatibility layers with PyTorch, too. Plus, ROCm-SMI (apparently) has all the energy-related management APIs we need -- measuring power and energy, and setting the power limit and GPU frequency.
First, we should evaluate whether the ROCm-SMI APIs behave like their NVML counterparts and show similar time/energy behavior with NVIDIA GPUs.
Then, implementatino-wise, ROCm-SMI lacks a Python package (something like nvidia-smi-py that NVIDIA officially provides). We should probably package that ourselves (much like the pynvml, which is a community-supported Python binding for NVML), until AMD provides an official binding.
Progress
- [x] Abstraction layer over GPUs (#46)
- [x] AMDSMI + ROCm 6.0 implementations of management APIs (#57)
- [x] Track down power/energy API issue (https://github.com/ROCm/amdsmi/issues/22)
- [ ] The cumulative energy counter works on MI200, MI210, MI250, and MI300x but not on MI100 (https://github.com/ROCm/amdsmi/issues/38) -- Issue warning after checking GPU board info?
- [ ] Test measurement and optimization on AMD GPUs