telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

Add missing nvidia-smi calls to plugin (nvidia_smi.go)

Open Dazog opened this issue 6 years ago • 18 comments

Feature Request

Add the following metrics to the nvidia-smi plugin that are missing: pci_bandwidth clocks.current.graphics clocks.current.sm clocks.current.memory clocks.current.video encoder_utilization driver_version vbios_version ecc.mode.current count serial pcie.link.gen.current display_mode display_active encoder.stats.sessionCount encoder.stats.averageFps encoder.stats.averageLatency

Opening a feature request kicks off a discussion.

Proposal: Add Missing Nvidia-smi api calls to existing plugin.

Current behavior: Missing calls that can be utilized.

Desired behavior: Add missing calls so more information can be gathered.

Use case: [Why is this important (helps with prioritizing requests)]

To make the GPU monitoring more complete.

Dazog avatar Mar 10 '19 17:03 Dazog

Related #5119 #5532

danielnelson avatar Mar 11 '19 20:03 danielnelson

I can help debug these if you need a tester.

I don't have a way to spin builds to teset these calls, but they were pulled directly from the nvidia-smi application.

Dazog avatar Mar 11 '19 21:03 Dazog

I wonder if we ought to move to the XML output, could you run nvidia-smi -q -u -x --dtd and attach the output?

danielnelson avatar Mar 11 '19 22:03 danielnelson

Nvidia.smi.output.txt

As requested

Dazog avatar Mar 11 '19 22:03 Dazog

I guess we don't want the -u flag, could you run it again without it: nvidia-smi -q -x --dtd

danielnelson avatar Mar 11 '19 22:03 danielnelson

nvidia.smi.take2.txt Yep we have data now.

Dazog avatar Mar 11 '19 23:03 Dazog

Attached is a Turing 1660 Gpu to add to this incase the newest gpu's have more api calls. nvidia.smi.turing.txt

Dazog avatar Mar 30 '19 01:03 Dazog

@danielnelson how we doing? anything else you need from us?

Edit: I believe some of this was encompassed by https://github.com/influxdata/telegraf/pull/5885/commits/2325c9734d9e51047b98f91745b09618925b0de2 but not everything. It looks like under utilization encoder_util and decoder_util are still missing as well as a few others. After looking at the output you are capable of querying from nvidia-smi --help-query-gpu it looks like this would require a switch to xml parsing to get all of the information as the former is lacking

@aaronjwood thank you for whats been added so far

dirtycajunrice avatar Jun 22 '19 14:06 dirtycajunrice

I think we have collected the important information to where someone could work on this, but I don't personally have this on my schedule right now. Will probably depend on a community contribution to be completed.

danielnelson avatar Jun 24 '19 18:06 danielnelson

I think we have collected the important information to where someone could work on this, but I don't personally have this on my schedule right now. Will probably depend on a community contribution to be completed.

@danielnelson I am working with Tesla M60. I could try adding missing metrix and test the code. I'm not a skilled developer though, so I'd need some mentoring.

DamianRemotr avatar Sep 24 '19 08:09 DamianRemotr

@DamianRemotr Some of this data could probably be added without switching to the XML format, it might be worth doing this first, since it only requires modifying a few strings in the plugin and updating the tests. Switching over to the XML format would be a larger overhaul of the plugin, not a massive amount of work but it would be a bit harder. Will try to help out in either case as much as I can.

danielnelson avatar Sep 24 '19 17:09 danielnelson

@danielnelson the csv output is severely limited. there are significantly more metrics available via xml

dirtycajunrice avatar Sep 24 '19 20:09 dirtycajunrice

@danielnelson I'll look at adding some code this weekend and return with eventual questions on Monday. Can I catch you on IRC?

DamianRemotr avatar Sep 25 '19 07:09 DamianRemotr

Can I catch you on IRC?

No, but you can open a draft PR and we can discuss there. I might be a bit less available next week due to InfluxDays and related activities.

danielnelson avatar Sep 26 '19 00:09 danielnelson

The plugin has been updated to use the XML output in #6639, this should make supporting these new fields easier to implement.

danielnelson avatar Nov 13 '19 00:11 danielnelson

It's been a while since this has had attention. @DamianRemotr were you able to put some code together to add the fields? Are you able to put it up in a PR? Thanks!

reimda avatar Aug 22 '22 18:08 reimda

Hi,

we have several NVIDIA Tesla P4 and Tesla T4 GPUs. Those cards are used by remote workers. The P4 are part of a Citrix Hypervisor setup and are used as vGPUs on Windows Virtual Machines. One card is for example devided into 8 virtual parts. Under Windows you can see this card as "GRID P4-1Q".

It would also be nice if you could add information about the current license status of the GPUs. Yes, graphics cards need to be licensed today. If the license is not available performance drops to a minimum. Therefore it would be nice to monitor the current license state.

I will upload a tesla_p4.xml and a tesla_t4.xml for you into the testdata folder. The tesla_p4.xml contains the specific part for the license information which looks like this:

	<vgpu_software_licensed_product>
		<licensed_product_name>NVIDIA RTX Virtual Workstation</licensed_product_name>
		<license_status>Licensed (Expiry: 2022-9-22 6:57:30 GMT)</license_status>
	</vgpu_software_licensed_product>

Can you please add the necessary parts into the plugin?

kuriosity121 avatar Sep 21 '22 09:09 kuriosity121

Sorry for the pull requests. Never worked with git before. I ill just add the examples as text files here. tesla_p4.xml.txt tesla_t4.xml.txt

kuriosity121 avatar Sep 21 '22 10:09 kuriosity121

The encoder_stats_average_fps metric doesn't seem to be working in windows 11 anymore. This is my output. output.txt

BrentonPoke avatar Mar 15 '23 04:03 BrentonPoke