lxd
lxd copied to clipboard
LXD cluster does not return metrics for the project if none instance form that project is run on the queried node.
Required information
- Distribution: Ubuntu
- Distribution version: 23.10 (Mantic) (arm64)
- The output of "lxc info" or if that fails:
- Kernel version: 6.5.0-1009-raspi # 12-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 17 11:45:08 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
- LXD version: 5.19
- Storage backend in use: microceph
Issue description
LXD metric API does not return metrics for the project if none instance from project run on queried node. Query return metrics only for projects which nodes implicitly run on queried node.
Steps to reproduce
- Create LCD cluster with 3+ nodes.
- Create one or more additional project(s) in the cluster
- Deploy and start several instances to all of projects. Be sure that each node hosts instances from every project
- Run
lxc query /1.0/metrics
command on all nodes and ensure that the query returns metrics for all instances in all projects in the cluster. - Stop instances from project default hosted on one of the nodes (be sure that other instances from project "default" continue running on other nodes) and then run
lxc query /1.0/metrics
command on that node. Query return metrics for all instances from all project except project "default". - run
lxc query /1.0/metrics
command on other nodes and ensure that the query returns metrics for all instances in all projects in the cluster.
This behavior garbles metric collected by external scrapes and external dashboards like Prometheus+Graphana.
@tregubovav-dev is this still an issue with LXD 5.20?
@simondeziel would you mind seeing if you can validate if this remains an issue?
Since the introduction of metrics_instances_count
extension, this bug is fixed. Here's how I did the initial reproduction with 5.19/stable
:
$ lxc launch ubuntu-daily:22.04 c1 -c security.nesting=true -c security.devlxd.images=true
$ lxc shell c1
# snap refresh lxd --channel 5.19/stable
lxd (5.19/stable) 5.19-8635f82 from Canonical✓ refreshed
# lxd init --auto
# lxc init ubuntu-minimal-daily:22.04 c2
# lxc query /1.0/metrics | grep -v ^lxd_go | grep -v ^#
lxd_operations_total 1
lxd_warnings_total 3
lxd_uptime_seconds 65.457576337
This confirms stopped instances are not reported about. Now with 5.21/edge
that includes the metrics_instances_count
extension, offline instances are reported:
# snap refresh lxd --channel 5.21/edge
# lxc query /1.0/metrics | grep -v ^lxd_go | grep -v ^# | grep -wF c2
lxd_cpu_seconds_total{cpu="0",mode="system",name="c2",project="default",state="STOPPED",type="container"} 0
lxd_cpu_seconds_total{cpu="0",mode="user",name="c2",project="default",state="STOPPED",type="container"} 0
lxd_cpu_effective_total{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_filesystem_avail_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.5333982208e+11
lxd_filesystem_free_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.5333982208e+11
lxd_filesystem_size_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.54700218368e+11
lxd_memory_Active_bytes{name="c2",project="default",state="STOPPED",type="container"} 0
lxd_memory_Inactive_bytes{name="c2",project="default",state="STOPPED",type="container"} 0
lxd_memory_MemAvailable_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516001e+10
lxd_memory_MemFree_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516001e+10
lxd_memory_MemTotal_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516e+10
lxd_memory_Swap_bytes{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_memory_OOM_kills_total{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_procs_total{name="c2",project="default",state="STOPPED",type="container"} 0
So I believe your specific bug is fixed but since I have not use the exact same reproducing steps (cluster setup), please do re-open the bug if not fixed in 5.21
or later.