Add missing nvidia-smi calls to plugin (nvidia_smi.go)
Feature Request
Add the following metrics to the nvidia-smi plugin that are missing: pci_bandwidth clocks.current.graphics clocks.current.sm clocks.current.memory clocks.current.video encoder_utilization driver_version vbios_version ecc.mode.current count serial pcie.link.gen.current display_mode display_active encoder.stats.sessionCount encoder.stats.averageFps encoder.stats.averageLatency
Opening a feature request kicks off a discussion.
Proposal: Add Missing Nvidia-smi api calls to existing plugin.
Current behavior: Missing calls that can be utilized.
Desired behavior: Add missing calls so more information can be gathered.
Use case: [Why is this important (helps with prioritizing requests)]
To make the GPU monitoring more complete.
Related #5119 #5532
I can help debug these if you need a tester.
I don't have a way to spin builds to teset these calls, but they were pulled directly from the nvidia-smi application.
I wonder if we ought to move to the XML output, could you run nvidia-smi -q -u -x --dtd and attach the output?
I guess we don't want the -u flag, could you run it again without it: nvidia-smi -q -x --dtd
nvidia.smi.take2.txt Yep we have data now.
Attached is a Turing 1660 Gpu to add to this incase the newest gpu's have more api calls. nvidia.smi.turing.txt
@danielnelson how we doing? anything else you need from us?
Edit:
I believe some of this was encompassed by https://github.com/influxdata/telegraf/pull/5885/commits/2325c9734d9e51047b98f91745b09618925b0de2 but not everything.
It looks like under utilization encoder_util and decoder_util are still missing as well as a few others. After looking at the output you are capable of querying from nvidia-smi --help-query-gpu it looks like this would require a switch to xml parsing to get all of the information as the former is lacking
@aaronjwood thank you for whats been added so far
I think we have collected the important information to where someone could work on this, but I don't personally have this on my schedule right now. Will probably depend on a community contribution to be completed.
I think we have collected the important information to where someone could work on this, but I don't personally have this on my schedule right now. Will probably depend on a community contribution to be completed.
@danielnelson I am working with Tesla M60. I could try adding missing metrix and test the code. I'm not a skilled developer though, so I'd need some mentoring.
@DamianRemotr Some of this data could probably be added without switching to the XML format, it might be worth doing this first, since it only requires modifying a few strings in the plugin and updating the tests. Switching over to the XML format would be a larger overhaul of the plugin, not a massive amount of work but it would be a bit harder. Will try to help out in either case as much as I can.
@danielnelson the csv output is severely limited. there are significantly more metrics available via xml
@danielnelson I'll look at adding some code this weekend and return with eventual questions on Monday. Can I catch you on IRC?
Can I catch you on IRC?
No, but you can open a draft PR and we can discuss there. I might be a bit less available next week due to InfluxDays and related activities.
The plugin has been updated to use the XML output in #6639, this should make supporting these new fields easier to implement.
It's been a while since this has had attention. @DamianRemotr were you able to put some code together to add the fields? Are you able to put it up in a PR? Thanks!
Hi,
we have several NVIDIA Tesla P4 and Tesla T4 GPUs. Those cards are used by remote workers. The P4 are part of a Citrix Hypervisor setup and are used as vGPUs on Windows Virtual Machines. One card is for example devided into 8 virtual parts. Under Windows you can see this card as "GRID P4-1Q".
It would also be nice if you could add information about the current license status of the GPUs. Yes, graphics cards need to be licensed today. If the license is not available performance drops to a minimum. Therefore it would be nice to monitor the current license state.
I will upload a tesla_p4.xml and a tesla_t4.xml for you into the testdata folder. The tesla_p4.xml contains the specific part for the license information which looks like this:
<vgpu_software_licensed_product>
<licensed_product_name>NVIDIA RTX Virtual Workstation</licensed_product_name>
<license_status>Licensed (Expiry: 2022-9-22 6:57:30 GMT)</license_status>
</vgpu_software_licensed_product>
Can you please add the necessary parts into the plugin?
Sorry for the pull requests. Never worked with git before. I ill just add the examples as text files here. tesla_p4.xml.txt tesla_t4.xml.txt
The encoder_stats_average_fps metric doesn't seem to be working in windows 11 anymore. This is my output. output.txt