lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Metrics endpoint reporting MaxUint64 value for free/available memory on frozen VM

Open edlerd opened this issue 7 months ago • 13 comments

Please confirm

  • [x] I have searched existing issues to check if an issue already exists for the bug I encountered.

Distribution

ubuntu

Distribution version

snap

Output of "snap list --all lxd core20 core22 core24 snapd"

Name    Version         Rev    Tracking       Publisher   Notes
core20  20250213        2501   latest/stable  canonical✓  base,disabled
core20  20250407        2571   latest/stable  canonical✓  base
core22  20250315        1908   latest/stable  canonical✓  base,disabled
core22  20250408        1963   latest/stable  canonical✓  base
core24  20250318        888    latest/stable  canonical✓  base,disabled
core24  20250504        988    latest/stable  canonical✓  base
lxd     5.21.3-c5ae129  33110  5.21/stable    canonical✓  -
snapd   2.67.1          23771  latest/stable  canonical✓  snapd,disabled
snapd   2.68.4          24505  latest/stable  canonical✓  snapd

Output of "lxc info" or system info if it fails

-

Issue description

The GET /1.0/metrics endpoint sometimes reports wrong values for memory. See the example below, where free and available memory is way bigger and unreasonably high when compared to total memory.

lxd_memory_MemAvailable_bytes{name="nms",project="default",state="RUNNING",type="virtual-machine"} 1.844674407367206e+19
lxd_memory_MemFree_bytes{name="nms",project="default",state="RUNNING",type="virtual-machine"} 1.844674407367206e+19
lxd_memory_MemTotal_bytes{name="nms",project="default",state="RUNNING",type="virtual-machine"} 1.073741824e+09

Full instance configuration is

name: nms
description: ''
status: Frozen
status_code: 110
created_at: '2025-05-22T07:20:35.086464209Z'
last_used_at: '2025-05-23T09:06:39.579476605Z'
location: lxd0
type: virtual-machine
project: default
architecture: x86_64
ephemeral: false
stateful: false
profiles:
  - default
config:
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (release) (20250516)
  image.label: release
  image.os: ubuntu
  image.release: noble
  image.serial: '20250516'
  image.type: disk1.img
  image.version: '24.04'
  volatile.base_image: 114a1bc50c4d10b31da8c9fc91c181713acf0ce37eee13521dcfa3325e02ab84
  volatile.cloud-init.instance-id: a3feb58a-239d-4142-9ba1-d818bd7bc2c8
  volatile.eth-1.host_name: tap198af091
  volatile.eth-1.hwaddr: 00:16:3e:99:12:2e
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: 'false'
  volatile.uuid: a36842fb-d08b-4cdb-b056-1e5852fde1fd
  volatile.uuid.generation: a36842fb-d08b-4cdb-b056-1e5852fde1fd
  volatile.vsock_id: '2720916272'
devices: {}

Steps to reproduce

Sadly, I have no reproducer.

Information to attach

  • [ ] Any relevant kernel output (dmesg)
  • [ ] Instance log (lxc info NAME --show-log)
  • [ ] Instance configuration (lxc config show NAME --expanded)
  • [ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • [ ] Output of the client with --debug
  • [ ] Output of the daemon with --debug (or use lxc monitor while reproducing the issue)

edlerd avatar May 23 '25 10:05 edlerd

@edlerd where did you get the report from?

Please can you ask them to post the output of free -m inside the VMs at the time they are getting this output from the /1.0/metrics endpoint?

tomponline avatar May 23 '25 10:05 tomponline

@edlerd where did you get the report from?

Please can you ask them to post the output of free -m inside the VMs at the time they are getting this output from the /1.0/metrics endpoint?

The VM is frozen at the time this occurs. When it is running, the values are correct.

edlerd avatar May 23 '25 10:05 edlerd

This might be related to https://github.com/canonical/lxd-ui/issues/1229 recently also observed by our colleague @lorumic

edlerd avatar May 23 '25 10:05 edlerd

@edlerd if you can try and confirm if the lxd-agent is running in the guest or not that would also be useful

tomponline avatar May 23 '25 10:05 tomponline

@edlerd where did you get the report from? Please can you ask them to post the output of free -m inside the VMs at the time they are getting this output from the /1.0/metrics endpoint?

The VM is frozen at the time this occurs. When it is running, the values are correct.

Oh thats a super important detail - because the memory info wont be available from the lxd-agent

tomponline avatar May 23 '25 10:05 tomponline

Steps to reproduce

Sadly, I have no reproducer.

I do:

https://github.com/user-attachments/assets/868f0e76-51a8-49c8-9958-5411dfb5702f

lorumic avatar May 23 '25 10:05 lorumic

@lorumic please can you see if you can reproduce using the CLI to isolate the UI out of the equation

You can use lxc query /1.0/metrics to access the metrics that the UI is consuming.

Ta

tomponline avatar May 23 '25 10:05 tomponline

I suspect this is the metrics endpoint and/or the UI misinterpreting "no value" for an extremely small value as you cant get memory usage when the VM is frozen.

tomponline avatar May 23 '25 10:05 tomponline

@lorumic please can you see if you can reproduce using the CLI to isolate the UI out of the equation

You can use lxc query /1.0/metrics to access the metrics that the UI is consuming.

[email protected]@lorumic:~$ lxc query /1.0/metrics | grep 'lxd_memory_MemAvailable_bytes{name="nms",project="default"'
lxd_memory_MemAvailable_bytes{name="nms",project="default",state="RUNNING",type="virtual-machine"} 1.8446744073672397e+19

It's the same value shown in the bug report above (under "Issue description"). It's not a "no value" but rather an excessively high one that overflows to a negative number in JS.

Some new findings (or mere speculation possibly): I could suddenly not reproduce the issue anymore when the memory usage was around 20% (~200M out of a 1G total). I stopped and started the instance, and memory usage went up to 80%, which I remembered (and can be seen also in the screen capture above) was the value it had when the issue could be reproduced reliably. And I could reproduce it again. So maybe the high memory usage is somewhat related for the reproduction of the issue?

lorumic avatar May 23 '25 10:05 lorumic

I suspect this is the metrics endpoint and/or the UI misinterpreting "no value" for an extremely small value as you cant get memory usage when the VM is frozen.

The UI bug in the video was analyzed and the root problem is the free value being reported as huge, while the total is much smaller, like stated in the description of the original issue above. We have a workaround in the UI now with a sanitization of the memory values -- this specific case will be handled as "no memory information available" to avoid showing the as in the video.

edlerd avatar May 23 '25 10:05 edlerd

@lorumic @edlerd please can we have reproducer steps without the UI, e.g.

lxc launch foo ...
lxc freeze foo
lxc query /1.0/metrics | grep ...

This makes it much easier for the engineers to reproduce and fix.

Thanks

tomponline avatar May 23 '25 10:05 tomponline

@lorumic @edlerd please can we have reproducer steps without the UI, e.g.

lxc launch foo ...
lxc freeze foo
lxc query /1.0/metrics | grep ...

This makes it much easier for the engineers to reproduce and fix.

And add test so it never regresses once fixed :)

simondeziel avatar May 23 '25 13:05 simondeziel

@escabo connected the dots and 1.8446744073672397e+19 is MaxUint64

simondeziel avatar May 23 '25 13:05 simondeziel