experiment-impact-tracker
experiment-impact-tracker copied to clipboard
Getting error "Problem with output in nvidia-smi pmon -c 10"
Hi, we're getting this error in the log file :
experiment_impact_tracker.compute_tracker.ImpactTracker - ERROR - Encountered exception within power monitor thread!
experiment_impact_tracker.compute_tracker.ImpactTracker - ERROR - File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/compute_tracker.py", line 105, in launch_power_monitor
_sample_and_log_power(log_dir, initial_info, logger=logger)
File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/compute_tracker.py", line 69, in _sample_and_log_power
results = header["routing"]["function"](process_ids, logger=logger, region=initial_info['region']['id'], log_dir=log_dir)
File "/usr/local/lib/python3.7/dist-packages/experiment_impact_tracker/gpu/nvidia.py", line 117, in get_nvidia_gpu_power
raise ValueError('Problem with output in nvidia-smi pmon -c 10')
Is it an issue with our Nvidia GPU ? We are using Tesla T4.
Could you let us know what output you get if you run this from the command line on the machine you're using? This will help narrow down the source of the error.
$ nvidia-smi pmon -c 10
I am using Google Colab so it's not always the same GPU.
I ran subprocess.getoutput('nvidia-smi pmon -c 10')
but it gave me nothing:
# gpu pid type sm mem enc dec command
# Idx # C/G % % % % name
0 - - - - - - -
0 - - - - - - -
0 - - - - - - -
0 - - - - - - -
0 - - - - - - -
0 - - - - - - -
0 - - - - - - -
0 - - - - - - -
0 - - - - - - -
0 - - - - - - -
With subprocess.getoutput('nvidia-smi')
I obtained this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 45C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Hi, unfortunately colab isn't fully supported right now because they don't always expose the required hardware endpoints to calculate energy use. We are working on solutions and will follow up if we have something that works.