Can't track GPU metrics from a MIG GPU instance
🐛 Bug
When running training on an NVIDIA A100 80gb in a 1/7 Multi-Instance GPU (MIG) configuration aim does not capture GPU metrics
To reproduce
Run any simple training (I used the GAN example) on a MIG GPU and observe that no metrics show up in the associated experiment in Aim.
Expected behavior
Expect Aim to automatically capture GPU temp, utilization, memory like it does when using a full GPU. Confirmed working with a non MIG A100.
Environment
- Aim Version 3.13
- Python version 3.8.12
- pip version 22.1.2
- OS (e.g., Linux) client in Mac OS, server containerized
- Any other relevant information
Additional context
Discussed on public slack channel with Mihran:
Hey @Peter S. I’ve done some digging and unfortunately that’s the case, there’s no support for MIG currently in py3nvml, we’ll try to find another way to retrieve the GPU stats for MIG in the future, as it will take some time to find another suitable library or implement on our own. Would you mind to open an issue in GitHub for this?
Hi @PeterSulcs thanks for opening this issue. How much of a blocker is this?
Hey @PeterSulcs! Sorry for such a long delay. Could you please try out this version of aim and see if it successfully collects data for MIG GPU instance?
pip install aim==3.15.0.dev3
Good morning @mihran113, sorry for the delay. Would you be open to providing a 3.17.4+ version of this branch with the MIG GPU changes for us to test? The grpc additions that were made after 3.17.4 are needed now in our environment.
Hey @jennifer12121! We'll ship a patch in the upcoming couple of days with the change.
Hey @jennifer12121! Sorry for delay.
The fix for this was shipped with v3.17.6.
Let me know if it works as expected, so I can close the issue.
Closing this issue as there has not been clear feedback but the work had been done.