Metrics endpoint reporting MaxUint64 value for free/available memory on frozen VM
Please confirm
- [x] I have searched existing issues to check if an issue already exists for the bug I encountered.
Distribution
ubuntu
Distribution version
snap
Output of "snap list --all lxd core20 core22 core24 snapd"
Name Version Rev Tracking Publisher Notes
core20 20250213 2501 latest/stable canonical✓ base,disabled
core20 20250407 2571 latest/stable canonical✓ base
core22 20250315 1908 latest/stable canonical✓ base,disabled
core22 20250408 1963 latest/stable canonical✓ base
core24 20250318 888 latest/stable canonical✓ base,disabled
core24 20250504 988 latest/stable canonical✓ base
lxd 5.21.3-c5ae129 33110 5.21/stable canonical✓ -
snapd 2.67.1 23771 latest/stable canonical✓ snapd,disabled
snapd 2.68.4 24505 latest/stable canonical✓ snapd
Output of "lxc info" or system info if it fails
-
Issue description
The GET /1.0/metrics endpoint sometimes reports wrong values for memory. See the example below, where free and available memory is way bigger and unreasonably high when compared to total memory.
lxd_memory_MemAvailable_bytes{name="nms",project="default",state="RUNNING",type="virtual-machine"} 1.844674407367206e+19
lxd_memory_MemFree_bytes{name="nms",project="default",state="RUNNING",type="virtual-machine"} 1.844674407367206e+19
lxd_memory_MemTotal_bytes{name="nms",project="default",state="RUNNING",type="virtual-machine"} 1.073741824e+09
Full instance configuration is
name: nms
description: ''
status: Frozen
status_code: 110
created_at: '2025-05-22T07:20:35.086464209Z'
last_used_at: '2025-05-23T09:06:39.579476605Z'
location: lxd0
type: virtual-machine
project: default
architecture: x86_64
ephemeral: false
stateful: false
profiles:
- default
config:
image.architecture: amd64
image.description: ubuntu 24.04 LTS amd64 (release) (20250516)
image.label: release
image.os: ubuntu
image.release: noble
image.serial: '20250516'
image.type: disk1.img
image.version: '24.04'
volatile.base_image: 114a1bc50c4d10b31da8c9fc91c181713acf0ce37eee13521dcfa3325e02ab84
volatile.cloud-init.instance-id: a3feb58a-239d-4142-9ba1-d818bd7bc2c8
volatile.eth-1.host_name: tap198af091
volatile.eth-1.hwaddr: 00:16:3e:99:12:2e
volatile.last_state.power: RUNNING
volatile.last_state.ready: 'false'
volatile.uuid: a36842fb-d08b-4cdb-b056-1e5852fde1fd
volatile.uuid.generation: a36842fb-d08b-4cdb-b056-1e5852fde1fd
volatile.vsock_id: '2720916272'
devices: {}
Steps to reproduce
Sadly, I have no reproducer.
Information to attach
- [ ] Any relevant kernel output (
dmesg) - [ ] Instance log (
lxc info NAME --show-log) - [ ] Instance configuration (
lxc config show NAME --expanded) - [ ] Main daemon log (at
/var/log/lxd/lxd.logor/var/snap/lxd/common/lxd/logs/lxd.log) - [ ] Output of the client with
--debug - [ ] Output of the daemon with
--debug(or uselxc monitorwhile reproducing the issue)
@edlerd where did you get the report from?
Please can you ask them to post the output of free -m inside the VMs at the time they are getting this output from the /1.0/metrics endpoint?
@edlerd where did you get the report from?
Please can you ask them to post the output of
free -minside the VMs at the time they are getting this output from the /1.0/metrics endpoint?
The VM is frozen at the time this occurs. When it is running, the values are correct.
This might be related to https://github.com/canonical/lxd-ui/issues/1229 recently also observed by our colleague @lorumic
@edlerd if you can try and confirm if the lxd-agent is running in the guest or not that would also be useful
@edlerd where did you get the report from? Please can you ask them to post the output of
free -minside the VMs at the time they are getting this output from the /1.0/metrics endpoint?The VM is frozen at the time this occurs. When it is running, the values are correct.
Oh thats a super important detail - because the memory info wont be available from the lxd-agent
Steps to reproduce
Sadly, I have no reproducer.
I do:
https://github.com/user-attachments/assets/868f0e76-51a8-49c8-9958-5411dfb5702f
@lorumic please can you see if you can reproduce using the CLI to isolate the UI out of the equation
You can use lxc query /1.0/metrics to access the metrics that the UI is consuming.
Ta
I suspect this is the metrics endpoint and/or the UI misinterpreting "no value" for an extremely small value as you cant get memory usage when the VM is frozen.
@lorumic please can you see if you can reproduce using the CLI to isolate the UI out of the equation
You can use
lxc query /1.0/metricsto access the metrics that the UI is consuming.
[email protected]@lorumic:~$ lxc query /1.0/metrics | grep 'lxd_memory_MemAvailable_bytes{name="nms",project="default"'
lxd_memory_MemAvailable_bytes{name="nms",project="default",state="RUNNING",type="virtual-machine"} 1.8446744073672397e+19
It's the same value shown in the bug report above (under "Issue description"). It's not a "no value" but rather an excessively high one that overflows to a negative number in JS.
Some new findings (or mere speculation possibly): I could suddenly not reproduce the issue anymore when the memory usage was around 20% (~200M out of a 1G total). I stopped and started the instance, and memory usage went up to 80%, which I remembered (and can be seen also in the screen capture above) was the value it had when the issue could be reproduced reliably. And I could reproduce it again. So maybe the high memory usage is somewhat related for the reproduction of the issue?
I suspect this is the metrics endpoint and/or the UI misinterpreting "no value" for an extremely small value as you cant get memory usage when the VM is frozen.
The UI bug in the video was analyzed and the root problem is the free value being reported as huge, while the total is much smaller, like stated in the description of the original issue above. We have a workaround in the UI now with a sanitization of the memory values -- this specific case will be handled as "no memory information available" to avoid showing the as in the video.
@lorumic @edlerd please can we have reproducer steps without the UI, e.g.
lxc launch foo ...
lxc freeze foo
lxc query /1.0/metrics | grep ...
This makes it much easier for the engineers to reproduce and fix.
Thanks
@lorumic @edlerd please can we have reproducer steps without the UI, e.g.
lxc launch foo ... lxc freeze foo lxc query /1.0/metrics | grep ...This makes it much easier for the engineers to reproduce and fix.
And add test so it never regresses once fixed :)
@escabo connected the dots and 1.8446744073672397e+19 is MaxUint64