telegraf feat: New input plugin for libvirt

[x] Updated associated README.md.
[x] Wrote appropriate unit tests.
[x] Pull request title or commits are in conventional commit format

resolves #65 resolves #70 resolves #690

This is a continuation of work being done in following PRs:

#357 (based on https://github.com/alexzorin/libvirt-go -> https://github.com/rgbkrk/libvirt-go -> https://github.com/libvirt/libvirt-go which is official libvirt go library but uses cgo)
#2560 (which called virsh binary and parsed its output)
#3166 (based on https://github.com/libvirt/libvirt-go which is official libvirt go library but uses cgo)
#3592 (based on cgo-free but immature at that time: https://github.com/digitalocean/go-libvirt which was used to generate XML with metrics which were parsed by code from that PR)

This PR uses https://github.com/digitalocean/go-libvirt which became very mature project. It is still cgo-free and provides a pure Go interface for interacting with libvirt.

The plugin exposes all possible domain statistics which can be gathered from the newest versions of libvirt (>= 7.x.y but it will do its best to expose as much as possible from previous versions).

List of exposed metrics:

Statistics group	Metric name	Exposed Telegraf field	Description
state	state.state	state	state of the VM, returned as number from virDomainState enum
	state.reason	reason	reason for entering given state, returned as int from virDomain*Reason enum corresponding to given state
cpu_total	cpu.time	time	total cpu time spent for this domain in nanoseconds
	cpu.user	user	user cpu time spent in nanoseconds
	cpu.system	system	system cpu time spent in nanoseconds
	cpu.haltpoll.success.time	haltpoll_success_time	cpu halt polling success time spent in nanoseconds
	cpu.haltpoll.fail.time	haltpoll_fail_time	cpu halt polling fail time spent in nanoseconds
	cpu.cache.monitor.count	count	the number of cache monitors for this domain
	cpu.cache.monitor.<num>.name	name	the name of cache monitor <num>, not available for kernels from 4.14 upwards
	cpu.cache.monitor.<num>.vcpus	vcpus	vcpu list of cache monitor <num>, not available for kernels from 4.14 upwards
	cpu.cache.monitor.<num>.bank.count	bank_count	the number of cache banks in cache monitor <num>, not available for kernels from 4.14 upwards
	cpu.cache.monitor.<num>.bank.<index>.id	id	host allocated cache id for bank <index> in cache monitor <num>, not available for kernels from 4.14 upwards
	cpu.cache.monitor.<num>.bank.<index>.bytes	bytes	the number of bytes of last level cache that the domain is using on cache bank <index>, not available for kernels from 4.14 upwards
balloon	balloon.current	current	the memory in KiB currently used
	balloon.maximum	maximum	the maximum memory in KiB allowed
	balloon.swap_in	swap_in	the amount of data read from swap space (in KiB)
	balloon.swap_out	swap_out	the amount of memory written out to swap space (in KiB)
	balloon.major_fault	major_fault	the number of page faults when disk IO was required
	balloon.minor_fault	minor_fault	the number of other page faults
	balloon.unused	unused	the amount of memory left unused by the system (in KiB)
	balloon.available	available	the amount of usable memory as seen by the domain (in KiB)
	balloon.rss	rss	Resident Set Size of running domain's process (in KiB)
	balloon.usable	usable	the amount of memory which can be reclaimed by balloon without causing host swapping (in KiB)
	balloon.last-update	last_update	timestamp of the last update of statistics (in seconds)
	balloon.disk_caches	disk_caches	the amount of memory that can be reclaimed without additional I/O, typically disk (in KiB)
	balloon.hugetlb_pgalloc	hugetlb_pgalloc	the number of successful huge page allocations from inside the domain via virtio balloon
	balloon.hugetlb_pgfail	hugetlb_pgfail	the number of failed huge page allocations from inside the domain via virtio balloon
vcpu	vcpu.current	current	yes current number of online virtual CPUs
	vcpu.maximum	maximum	maximum number of online virtual CPUs
	vcpu.<num>.state	state	state of the virtual CPU <num>, as number from virVcpuState enum
	vcpu.<num>.time	time	virtual cpu time spent by virtual CPU <num> (in microseconds)
	vcpu.<num>.wait	wait	virtual cpu time spent by virtual CPU <num> waiting on I/O (in microseconds)
	vcpu.<num>.halted	halted	virtual CPU <num> is halted: yes or no (may indicate the processor is idle or even disabled, depending on the architecture)
	vcpu.<num>.halted	halted_i	virtual CPU <num> is halted: 1 (for "yes") or 0 (for other values) (may indicate the processor is idle or even disabled, depending on the architecture)
	vcpu.<num>.delay	delay	time the vCPU <num> thread was enqueued by the host scheduler, but was waiting in the queue instead of running. Exposed to the VM as a steal time.
	---	cpu_id	Information about mapping vcpu_id to cpu_id (id of physical cpu). Should only be exposed when statistics_group contains vcpu and additional_statistics contains vcpu_mapping (in config)
interface	net.count	count	number of network interfaces on this domain
	net.<num>.name	name	name of the interface <num>
	net.<num>.rx.bytes	rx_bytes	number of bytes received
	net.<num>.rx.pkts	rx_pkts	number of packets received
	net.<num>.rx.errs	rx_errs	number of receive errors
	net.<num>.rx.drop	rx_drop	number of receive packets dropped
	net.<num>.tx.bytes	tx_bytes	number of bytes transmitted
	net.<num>.tx.pkts	tx_pkts	number of packets transmitted
	net.<num>.tx.errs	tx_errs	number of transmission errors
	net.<num>.tx.drop	tx_drop	number of transmit packets dropped
perf	perf.cmt	cmt	the cache usage in Byte currently used, not available for kernels from 4.14 upwards
	perf.mbmt	mbmt	total system bandwidth from one level of cache, not available for kernels from 4.14 upwards
	perf.mbml	mbml	bandwidth of memory traffic for a memory controller, not available for kernels from 4.14 upwards
	perf.cpu_cycles	cpu_cycles	the count of cpu cycles (total/elapsed)
	perf.instructions	instructions	the count of instructions
	perf.cache_references	cache_references	the count of cache hits
	perf.cache_misses	cache_misses	the count of caches misses
	perf.branch_instructions	branch_instructions	the count of branch instructions
	perf.branch_misses	branch_misses	the count of branch misses
	perf.bus_cycles	bus_cycles	the count of bus cycles
	perf.stalled_cycles_frontend	stalled_cycles_frontend	the count of stalled frontend cpu cycles
	perf.stalled_cycles_backend	stalled_cycles_backend	the count of stalled backend cpu cycles
	perf.ref_cpu_cycles	ref_cpu_cycles	the count of ref cpu cycles
	perf.cpu_clock	cpu_clock	the count of cpu clock time
	perf.task_clock	task_clock	the count of task clock time
	perf.page_faults	page_faults	the count of page faults
	perf.context_switches	context_switches	the count of context switches
	perf.cpu_migrations	cpu_migrations	the count of cpu migrations
	perf.page_faults_min	page_faults_min	the count of minor page faults
	perf.page_faults_maj	page_faults_maj	the count of major page faults
	perf.alignment_faults	alignment_faults	the count of alignment faults
	perf.emulation_faults	emulation_faults	the count of emulation faults
block	block.count	count	number of block devices being listed
	block.<num>.name	name	name of the target of the block device <num> (the same name for multiple entries if --backing is present)
	block.<num>.backingIndex	backingIndex	when --backing is present, matches up with the <backingStore> index listed in domain XML for backing files
	block.<num>.path	path	file source of block device <num>, if it is a local file or block device
	block.<num>.rd.reqs	rd_reqs	number of read requests
	block.<num>.rd.bytes	rd_bytes	number of read bytes
	block.<num>.rd.times	rd_times	total time (ns) spent on reads
	block.<num>.wr.reqs	wr_reqs	number of write requests
	block.<num>.wr.bytes	wr_bytes	number of written bytes
	block.<num>.wr.times	wr_times	total time (ns) spent on writes
	block.<num>.fl.reqs	fl_reqs	total flush requests
	block.<num>.fl.times	fl_times	total time (ns) spent on cache flushing
	block.<num>.errors	errors	Xen only: the 'oo_req' value
	block.<num>.allocation	allocation	offset of highest written sector in bytes
	block.<num>.capacity	capacity	logical size of source file in bytes
	block.<num>.physical	physical	physical size of source file in bytes
	block.<num>.threshold	threshold	threshold (in bytes) for delivering the VIR_DOMAIN_EVENT_ID_BLOCK_THRESHOLD event. See domblkthreshold
iothread	iothread.count	count	maximum number of IOThreads in the subsequent list as unsigned int. Each IOThread in the list will will use it's iothread_id value as the <id>. There may be fewer <id> entries than the iothread.count value if the polling values are not supported
	iothread.<id>.poll-max-ns	poll_max_ns	maximum polling time in nanoseconds used by the <id> IOThread. A value of 0 (zero) indicates polling is disabled
	iothread.<id>.poll-grow	poll_grow	polling time grow value. A value of 0 (zero) growth is managed by the hypervisor
	iothread.<id>.poll-shrink	poll_shrink	polling time shrink value. A value of (zero) indicates shrink is managed by hypervisor
memory	memory.bandwidth.monitor.count	count	the number of memory bandwidth monitors for this domain, not available for kernels from 4.14 upwards
	memory.bandwidth.monitor.<num>.name	name	the name of monitor <num>, not available for kernels from 4.14 upwards
	memory.bandwidth.monitor.<num>.vcpus	vcpus	the vcpu list of monitor <num>, not available for kernels from 4.14 upwards
	memory.bandwidth.monitor.<num>.node.count	node_count	the number of memory controller in monitor <num>, not available for kernels from 4.14 upwards
	memory.bandwidth.monitor.<num>.node.<index>.id	id	host allocated memory controller id for controller <index> of monitor <num>, not available for kernels from 4.14 upwards
	memory.bandwidth.monitor.<num>.node.<index>.bytes.local	bytes_local	the accumulative bytes consumed by @vcpus that passing through the memory controller in the same processor that the scheduled host CPU belongs to, not available for kernels from 4.14 upwards
	memory.bandwidth.monitor.<num>.node.<index>.bytes.total	bytes_total	the total bytes consumed by @vcpus that passing through all memory controllers, either local or remote controller, not available for kernels from 4.14 upwards
dirtyrate	dirtyrate.calc_status	calc_status	the status of last memory dirty rate calculation, returned as number from virDomainDirtyRateStatus enum
	dirtyrate.calc_start_time	calc_start_time the	start time of last memory dirty rate calculation
	dirtyrate.calc_period	calc_period	the period of last memory dirty rate calculation
	dirtyrate.megabytes_per_second	megabytes_per_second	the calculated memory dirty rate in MiB/s
	dirtyrate.calc_mode	calc_mode	the calculation mode used last measurement (page-sampling/dirty-bitmap/dirty-ring)
	dirtyrate.vcpu.<num>.megabytes_per_second	megabytes_per_second	the calculated memory dirty rate for a virtual cpu in MiB/s

And additional statistics:

Statistics group	Exposed Telegraf tag	Exposed Telegraf field	Description
vcpu_mapping	vcpu_id	---	ID of Virtual CPU
	---	cpu_id	Comma separated list (exposed as a string) of Physical CPU IDs

Sep 15 '22 13:09 p-zak

Hi @p-zak, this looks pretty good. Thanks!

I have a question about future metric format changes. I assume libvirt's data model doesn't change very often, but if it does, how will it affect the metric format of this plugin? I don't see a hard coded mapping in the code that is like the table in the description so I assume there's a pattern mapping and a change in libvirt would change the metric format.

I would like to avoid the situation where a user starts using using telegraf + inputs.libvirt with one version of libvirt, then upgrades libvirt to a version that removes or renames a field. Then telegraf produces metrics of a slightly different format which depending on the outputs being used can cause write errors or query errors downstream. (see the format changes doc)

Oct 03 '22 20:10 reimda

@reimda I believe that indeed libvirt's data model doesn't change very often (probably that's why most of metrics are in the snake_case format but there are few in dash-case or camelCase format which weren't corrected).

And that's why there is mapping from source metric (from libvirt) to metrics which are exposed by this plugin. You can find it in libvirt_metric_format.go. It allows to expose only metrics which are known (till libvirt 8.7.0). If something changes (removal, addition, renaming), it will need to be adjusted in this plugin.

I hope that this is the approach you want to achieve? :)

Oct 03 '22 20:10 p-zak

@reimda I believe that indeed libvirt's data model doesn't change very often (probably that's why most of metrics are in the snake_case format but there are few in dash-case or camelCase format which weren't corrected).

And that's why there is mapping from source metric (from libvirt) to metrics which are exposed by this plugin. You can find it in libvirt_metric_format.go. It allows to expose only metrics which are known (till libvirt 8.7.0). If something changes (removal, addition, renaming), it will need to be adjusted in this plugin.

I hope that this is the approach you want to achieve? :)

I like how the current mapping code makes the names uniform. The only potential problem I see is that since it is pattern based, if the data from libvirt changes, it will change the metrics telegraf produces. If someone stores the metrics in a database and builds an application or dashboard that queries the database, then when metrics changes it has the potential to break queries and break the downstream application.

This is a problem that some other telegraf plugins have. We don't need to prevent it from happening here. I am ok with relying on libvirt's data model not changing often, but maybe we should put something in the docs that lets users know they need to expect changes in the metric format depending on which version of libvirt they use. What do you think? Could you add a note in readme.md?

Oct 05 '22 21:10 reimda

Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip. Downloads for additional architectures and packages are available below.

:relaxed: This pull request doesn't significantly change the Telegraf binary size (less than 1%)

:package: Click here to get additional PR build artifacts

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	darwin_arm64.tar.gz	windows_i386.zip
armel.deb	armv6hl.rpm	freebsd_amd64.tar.gz
armhf.deb	i386.rpm	freebsd_armv7.tar.gz
i386.deb	ppc64le.rpm	freebsd_i386.tar.gz
mips.deb	riscv64.rpm	linux_amd64.tar.gz
mipsel.deb	s390x.rpm	linux_arm64.tar.gz
ppc64el.deb	x86_64.rpm	linux_armel.tar.gz
riscv64.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_riscv64.tar.gz
		linux_s390x.tar.gz
		static_linux_amd64.tar.gz

Oct 12 '22 22:10 telegraf-tiger[bot]

Thanks Paweł!

Oct 12 '22 23:10 reimda