Integrate NVIDIA libraries for accellerators
https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
^^^ library
Branch for this issue: https://github.com/TACC/tacc_stats/tree/dcgm_support
Using this document to see what metrics Cazes wants: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv
(From Cazes:) From the PCI section, I’d like to keep track of bytes moved over the PCI bus. We probably need to talk to NVidia on on this one because I don’t know how retries factors in. If it doesn’t, then just keep track of bytes transmitted/recevied. It’s allso not clear which direction is transmit/receive.
#PCIE #DCGM_FI_DEV_PCIE_TX_THROUGHPUT counter Total number of bytes transmitted through PCIe TX (in KB) via NVML. #DCGM_FI_DEV_PCIE_RX_THROUGHPUT counter Total number of bytes received through PCIe RX (in KB) via NVML. #DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter Total number of PCIe retries.
In a similar vein, I’d like to see what bandwidth we’re getting across the PCIe bus: #DCGM_FI_PROF_PCIE_TX_BYTES gauge The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second. #DCGM_FI_PROF_PCIE_RX_BYTES gauge The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
And finally, are the tensor cores being used: #DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge Ratio of cycles the tensor (HMMA) pipe is active (in %).
I don’t see a metric to measure memory bandwidth from the HBM.
These values should tell us how well the GPU is being used and whether or not the tensor cores are being used. I don’t expect to see them used unless it’s a pytorch or tf job.
We should also be able to tell if the GPU is spending more time moving data than calculating.