lxd icon indicating copy to clipboard operation
lxd copied to clipboard

LXD cluster does not return metrics for the project if none instance form that project is run on the queried node.

Open tregubovav-dev opened this issue 1 year ago • 1 comments

Required information

  • Distribution: Ubuntu
  • Distribution version: 23.10 (Mantic) (arm64)
  • The output of "lxc info" or if that fails:
    • Kernel version: 6.5.0-1009-raspi # 12-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 17 11:45:08 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
    • LXD version: 5.19
    • Storage backend in use: microceph

Issue description

LXD metric API does not return metrics for the project if none instance from project run on queried node. Query return metrics only for projects which nodes implicitly run on queried node.

Steps to reproduce

  1. Create LCD cluster with 3+ nodes.
  2. Create one or more additional project(s) in the cluster
  3. Deploy and start several instances to all of projects. Be sure that each node hosts instances from every project
  4. Run lxc query /1.0/metrics command on all nodes and ensure that the query returns metrics for all instances in all projects in the cluster.
  5. Stop instances from project default hosted on one of the nodes (be sure that other instances from project "default" continue running on other nodes) and then run lxc query /1.0/metrics command on that node. Query return metrics for all instances from all project except project "default".
  6. run lxc query /1.0/metrics command on other nodes and ensure that the query returns metrics for all instances in all projects in the cluster.

This behavior garbles metric collected by external scrapes and external dashboards like Prometheus+Graphana.

tregubovav-dev avatar Jan 26 '24 06:01 tregubovav-dev

@tregubovav-dev is this still an issue with LXD 5.20?

@simondeziel would you mind seeing if you can validate if this remains an issue?

tomponline avatar Feb 21 '24 13:02 tomponline

Since the introduction of metrics_instances_count extension, this bug is fixed. Here's how I did the initial reproduction with 5.19/stable:

$ lxc launch ubuntu-daily:22.04 c1 -c security.nesting=true -c security.devlxd.images=true
$ lxc shell c1
# snap refresh lxd --channel 5.19/stable
lxd (5.19/stable) 5.19-8635f82 from Canonical✓ refreshed
# lxd init --auto
# lxc init ubuntu-minimal-daily:22.04 c2
# lxc query /1.0/metrics | grep -v ^lxd_go | grep -v ^#
lxd_operations_total 1
lxd_warnings_total 3
lxd_uptime_seconds 65.457576337

This confirms stopped instances are not reported about. Now with 5.21/edge that includes the metrics_instances_count extension, offline instances are reported:

# snap refresh lxd --channel 5.21/edge
# lxc query /1.0/metrics | grep -v ^lxd_go | grep -v ^# | grep -wF c2
lxd_cpu_seconds_total{cpu="0",mode="system",name="c2",project="default",state="STOPPED",type="container"} 0
lxd_cpu_seconds_total{cpu="0",mode="user",name="c2",project="default",state="STOPPED",type="container"} 0
lxd_cpu_effective_total{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_filesystem_avail_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.5333982208e+11
lxd_filesystem_free_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.5333982208e+11
lxd_filesystem_size_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.54700218368e+11
lxd_memory_Active_bytes{name="c2",project="default",state="STOPPED",type="container"} 0
lxd_memory_Inactive_bytes{name="c2",project="default",state="STOPPED",type="container"} 0
lxd_memory_MemAvailable_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516001e+10
lxd_memory_MemFree_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516001e+10
lxd_memory_MemTotal_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516e+10
lxd_memory_Swap_bytes{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_memory_OOM_kills_total{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_procs_total{name="c2",project="default",state="STOPPED",type="container"} 0

So I believe your specific bug is fixed but since I have not use the exact same reproducing steps (cluster setup), please do re-open the bug if not fixed in 5.21 or later.

simondeziel avatar Mar 13 '24 18:03 simondeziel