qubes-issues
qubes-issues copied to clipboard
`deferring g.e. 0x... (pfn 0x...)` messages flooding all VMs
Qubes OS release
R4.1
Brief summary
After the recent Qubes updates (to dom0 and all templateVMs), I see a lot of deferring g.e. 0x... (pfn 0x...)
messages in dmesg
. Not sure if these message relate to the fact that my Qubes system becomes unusable after a few (~4) hours. While everything works fine for a few hours after restart, Qubes ultimately slows down to the point where it's no longer usable.
Steps to reproduce
- Update qubes-* packages in dom0 and a TemplateVM to latest stable versions
- Restart the system
- Run
dmesg
in dom0 and any AppVM
Expected behavior
System runs as before, no Xen warnings flooding dmesg
Actual behavior
Multiple deferring g.e. 0x... (pfn 0x...)
messages per second in dom0 and all AppVMs
Same, except no slowdown occurs, and I also saw the following flood of messages at AppVM.
[130669.067576] xen:grant_table: g.e. 0x679a still pending
Multiple
deferring g.e. 0x... (pfn 0x...)
messages per second in dom0 and all AppVMs
This looks like it is related to the recent frontend security fixes. I suspect the backend is not releasing grant references quickly enough.
Running dmesg -w
in the VM, I see a ton of those messages whenever any of its windows is resized, opened, or closed. This also consumes more and more memory, and eventually the VM slows down to a crawl as it starts swapping.
Latest current-testing everywhere, but IIRC it's been this way since I switched to R4.1 a couple of weeks ago. (Back then I hadn't yet tried to really trigger it in the worst way by resizing windows.)
Running
dmesg -w
in the VM, I see a ton of those messages whenever any of its windows is resized, opened, or closed. This also consumes more and more memory, and eventually things slow down to a crawl due to the VM swapping.Latest current-testing everywhere, but IIRC it's been this way since I switched to R4.1 a couple of weeks ago. (Back then I hadn't yet tried to really trigger it in the worst way by resizing windows.)
This looks like a bug in the GUI daemon and/or the dom0 kernel: it needs to release the grant tables when it is done using them.
Can Confirm this here on and AMD Renoir System (Thinkpad P14s Gen2). These are found in dom0 aswell as in AppVMs. I am happy to provide further debug info if needed. Problem is that log auditing gets really annyoing in qubes due to this
I can confirm this as well.
This message flooding is happening in latest R4.0 also, in dom0 and AppVMs. Currently I'm not seeing noticable slowdown on any R4.0 or R4.1 machines, and message rate seems to be lower in R4.0.
A small note: this also results in the journal occupying a lot of space. Shouldn't be a concern for AppVMs, but I just recovered several GBs in my dom0 by vacuuming the journal
Just to add my voice that I'm experiencing this issue also, as laid out in #7456.
Automated announcement from builder-github
The component linux-kernel-latest
(including package kernel-latest-5.17.4-2.fc25.qubes
) has been pushed to the r4.0
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The component linux-kernel-latest
(including package kernel-latest-5.17.4-2.fc32.qubes
) has been pushed to the r4.1
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The component linux-kernel-5-4
(including package kernel-5.4.190-1.fc25.qubes
) has been pushed to the r4.0
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The component linux-kernel
(including package kernel-5.10.112-1.fc32.qubes
) has been pushed to the r4.1
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The component linux-kernel-4-19
(including package kernel-419-4.19.239-1.pvops.qubes
) has been pushed to the r4.0
testing repository for dom0.
To test this update, please install it with the following command:
sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing
Automated announcement from builder-github
The component linux-kernel
(including package kernel-5.10.112-1.fc32.qubes
) has been pushed to the r4.1
stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
Automated announcement from builder-github
The component linux-kernel-5-4
(including package kernel-5.4.190-1.fc25.qubes
) has been pushed to the r4.0
stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
Automated announcement from builder-github
The component linux-kernel-4-19
(including package kernel-419-4.19.245-1.pvops.qubes
) has been pushed to the r4.0
stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
Automated announcement from builder-github
The component linux-kernel-latest
(including package kernel-latest-5.18.9-1.fc32.qubes
) has been pushed to the r4.1
stable repository for dom0.
To install this update, please use the standard update command:
sudo qubes-dom0-update
Or update dom0 via Qubes Manager.
I fear there may have been a regression somewhere as today I reexperienced this issue. On kernel 5.15.52
and was unable to qvm-run on any AppVM even if restarting it. Effectively locking up my system until a reboot.
Errors noted included the deferring g.e. 0x...
as well as gntshr: error: ioctl failed: No space left on device
.
I fear there may have been a regression somewhere as today I reexperienced this issue. On kernel
5.15.52
and was unable to qvm-run on any AppVM even if restarting it. Effectively locking up my system until a reboot.Errors noted included the
deferring g.e. 0x...
as well asgntshr: error: ioctl failed: No space left on device
.
Can you upgrade to a newer kernel and see if this fixes the problem?
Hi @DemiMarie,
To confirm reproducibility, on kernel 5.15.52
(for both dom0 and AppVM's) running qvm-run {appvm} echo
causes deferring g.e. 0x...
in dmesg for dom0. Also resizing a window in said AppVM causes deferring g.e. 0x...
messages in dmesg for that AppVM.
I can confirm that using kernel 5.18.9
(for both dom0 and AppVM's) causes these error messages to stop. However, resizing an AppVM window causes a trace in dmesg for dom0. Ultimately I would prefer to continue using kernel
as opposed to kernel-latest
.
The trace contains reference to WARNING: CPU: 0 PID: 5123 at drivers/xen/gntdev.c:399 __unmap_grant_pages_done+0xfe/0x110 [xen_gntdev]
I can provide a full trace, however I would want to PGP encrypt it to you as it contains hardware identifiers. I can't find your PGP key in qubes-secpack though? Alternatively I know your on matrix so I can send it there.
Those deferring
messages are kind of expected, only still pending
are worrying. It seems journald logs printk(KERN_DEBUG ...)
messages just like any other, and it isn't obvious to distinguish them from actual errors by normal user. @DemiMarie can you change that to pr_debug
, so it would be disabled by default? The leaking
and still pending
should be still logged by default.
As for WARNING
, that's fixed in newer kernel already (there is https://github.com/QubesOS/updates-status/issues/3030 in testing repo).
Thanks @marmarek good to know.
Having a look at the logs from yesterday when the lockup happened and noticed also the following, is this any concern?
xen:grant_table: xen/grant-table: max_grant_frames reached cur=2048 extra=1 limit=2048 gnttab_free_count=0 req_entries=1
If I recall correctly a hotfix for this issue used to be to increase /sys/module/xen_gntalloc/parameters/limit
. So not sure if that log message from yesterday is a symptom of something?
I currently don't see it in dom0 on 5.10.128-1, but a lot in domU on 5.15.57-1.
On Tue, Aug 30, 2022 at 08:02:43AM -0700, 3hhh wrote:
I currently don't see it in dom0 on 5.10.128-1, but a lot in domU on 5.15.57-1.
Dont see it at all with 5.15.63-1
Those
deferring
messages are kind of expected, onlystill pending
are worrying.
@marmarek When you say they are kind of expected, why are they thrown in the first place? This is flooding kmsg up to a point where it becomes unusable (i.e. for debugging). Is there anything we can do (i.e. kernel cmdline option) to suppress them?
@ctr49: they should be suppressed by default (that they are not is a bug).
Right, and afaict it's caused by https://github.com/QubesOS/qubes-linux-kernel/blob/master/increase-reclaim-speed.patch#L124 (a bare printk, without pr_debug/pr_warn)
see https://github.com/QubesOS/qubes-linux-kernel/pull/682
but a lot in domU on 5.15.57-1.
This is rather old 5.15.x kernel. The bug was fixed in >=5.15.61 already. https://github.com/QubesOS/updates-status/issues/3037 was uploaded to the stable repo over 2 months ago.
I'm closing this now as resolved. Please let me know if you still can reproduce issue with up to date kernel.
@marmarek I'm still seeing this on the very latest 6.0.8-1.fc32.qubes from current-testing