aim icon indicating copy to clipboard operation
aim copied to clipboard

Can't track GPU metrics from a MIG GPU instance

Open PeterSulcs opened this issue 3 years ago • 1 comments

🐛 Bug

When running training on an NVIDIA A100 80gb in a 1/7 Multi-Instance GPU (MIG) configuration aim does not capture GPU metrics

To reproduce

Run any simple training (I used the GAN example) on a MIG GPU and observe that no metrics show up in the associated experiment in Aim.

Expected behavior

Expect Aim to automatically capture GPU temp, utilization, memory like it does when using a full GPU. Confirmed working with a non MIG A100.

Environment

  • Aim Version 3.13
  • Python version 3.8.12
  • pip version 22.1.2
  • OS (e.g., Linux) client in Mac OS, server containerized
  • Any other relevant information

Additional context

Discussed on public slack channel with Mihran:

Hey @Peter S. I’ve done some digging and unfortunately that’s the case, there’s no support for MIG currently in py3nvml, we’ll try to find another way to retrieve the GPU stats for MIG in the future, as it will take some time to find another suitable library or implement on our own. Would you mind to open an issue in GitHub for this?

PeterSulcs avatar Aug 31 '22 16:08 PeterSulcs

Hi @PeterSulcs thanks for opening this issue. How much of a blocker is this?

SGevorg avatar Sep 02 '22 05:09 SGevorg

Hey @PeterSulcs! Sorry for such a long delay. Could you please try out this version of aim and see if it successfully collects data for MIG GPU instance?

pip install aim==3.15.0.dev3

mihran113 avatar Nov 18 '22 20:11 mihran113

Good morning @mihran113, sorry for the delay. Would you be open to providing a 3.17.4+ version of this branch with the MIG GPU changes for us to test? The grpc additions that were made after 3.17.4 are needed now in our environment.

jennifer12121 avatar Jul 10 '23 16:07 jennifer12121

Hey @jennifer12121! We'll ship a patch in the upcoming couple of days with the change.

mihran113 avatar Jul 11 '23 11:07 mihran113

Hey @jennifer12121! Sorry for delay. The fix for this was shipped with v3.17.6. Let me know if it works as expected, so I can close the issue.

mihran113 avatar Nov 14 '23 22:11 mihran113

Closing this issue as there has not been clear feedback but the work had been done.

SGevorg avatar Feb 17 '24 20:02 SGevorg