qubes-issues icon indicating copy to clipboard operation
qubes-issues copied to clipboard

`deferring g.e. 0x... (pfn 0x...)` messages flooding all VMs

Open SaswatPadhi opened this issue 2 years ago • 25 comments

Qubes OS release

R4.1

Brief summary

After the recent Qubes updates (to dom0 and all templateVMs), I see a lot of deferring g.e. 0x... (pfn 0x...) messages in dmesg. Not sure if these message relate to the fact that my Qubes system becomes unusable after a few (~4) hours. While everything works fine for a few hours after restart, Qubes ultimately slows down to the point where it's no longer usable.

Steps to reproduce

  1. Update qubes-* packages in dom0 and a TemplateVM to latest stable versions
  2. Restart the system
  3. Run dmesg in dom0 and any AppVM

Expected behavior

System runs as before, no Xen warnings flooding dmesg

Actual behavior

Multiple deferring g.e. 0x... (pfn 0x...) messages per second in dom0 and all AppVMs

SaswatPadhi avatar Mar 18 '22 19:03 SaswatPadhi

Same, except no slowdown occurs, and I also saw the following flood of messages at AppVM. [130669.067576] xen:grant_table: g.e. 0x679a still pending

noskb avatar Mar 18 '22 22:03 noskb

Multiple deferring g.e. 0x... (pfn 0x...) messages per second in dom0 and all AppVMs

This looks like it is related to the recent frontend security fixes. I suspect the backend is not releasing grant references quickly enough.

DemiMarie avatar Mar 19 '22 14:03 DemiMarie

Running dmesg -w in the VM, I see a ton of those messages whenever any of its windows is resized, opened, or closed. This also consumes more and more memory, and eventually the VM slows down to a crawl as it starts swapping.

Latest current-testing everywhere, but IIRC it's been this way since I switched to R4.1 a couple of weeks ago. (Back then I hadn't yet tried to really trigger it in the worst way by resizing windows.)

rustybird avatar Apr 02 '22 17:04 rustybird

Running dmesg -w in the VM, I see a ton of those messages whenever any of its windows is resized, opened, or closed. This also consumes more and more memory, and eventually things slow down to a crawl due to the VM swapping.

Latest current-testing everywhere, but IIRC it's been this way since I switched to R4.1 a couple of weeks ago. (Back then I hadn't yet tried to really trigger it in the worst way by resizing windows.)

This looks like a bug in the GUI daemon and/or the dom0 kernel: it needs to release the grant tables when it is done using them.

DemiMarie avatar Apr 02 '22 18:04 DemiMarie

Can Confirm this here on and AMD Renoir System (Thinkpad P14s Gen2). These are found in dom0 aswell as in AppVMs. I am happy to provide further debug info if needed. Problem is that log auditing gets really annyoing in qubes due to this

kr4t0 avatar Apr 02 '22 20:04 kr4t0

I can confirm this as well.

3hhh avatar Apr 03 '22 17:04 3hhh

This message flooding is happening in latest R4.0 also, in dom0 and AppVMs. Currently I'm not seeing noticable slowdown on any R4.0 or R4.1 machines, and message rate seems to be lower in R4.0.

noptys avatar Apr 04 '22 22:04 noptys

A small note: this also results in the journal occupying a lot of space. Shouldn't be a concern for AppVMs, but I just recovered several GBs in my dom0 by vacuuming the journal

SaswatPadhi avatar Apr 06 '22 00:04 SaswatPadhi

Just to add my voice that I'm experiencing this issue also, as laid out in #7456.

Tonux599 avatar Apr 19 '22 22:04 Tonux599

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.17.4-2.fc25.qubes) has been pushed to the r4.0 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot avatar Apr 22 '22 05:04 qubesos-bot

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.17.4-2.fc32.qubes) has been pushed to the r4.1 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot avatar Apr 22 '22 05:04 qubesos-bot

Automated announcement from builder-github

The component linux-kernel-5-4 (including package kernel-5.4.190-1.fc25.qubes) has been pushed to the r4.0 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot avatar Apr 22 '22 12:04 qubesos-bot

Automated announcement from builder-github

The component linux-kernel (including package kernel-5.10.112-1.fc32.qubes) has been pushed to the r4.1 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot avatar Apr 22 '22 12:04 qubesos-bot

Automated announcement from builder-github

The component linux-kernel-4-19 (including package kernel-419-4.19.239-1.pvops.qubes) has been pushed to the r4.0 testing repository for dom0. To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot avatar Apr 22 '22 13:04 qubesos-bot

Automated announcement from builder-github

The component linux-kernel (including package kernel-5.10.112-1.fc32.qubes) has been pushed to the r4.1 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

qubesos-bot avatar May 12 '22 23:05 qubesos-bot

Automated announcement from builder-github

The component linux-kernel-5-4 (including package kernel-5.4.190-1.fc25.qubes) has been pushed to the r4.0 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

qubesos-bot avatar May 19 '22 18:05 qubesos-bot

Automated announcement from builder-github

The component linux-kernel-4-19 (including package kernel-419-4.19.245-1.pvops.qubes) has been pushed to the r4.0 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

qubesos-bot avatar Jun 27 '22 10:06 qubesos-bot

Automated announcement from builder-github

The component linux-kernel-latest (including package kernel-latest-5.18.9-1.fc32.qubes) has been pushed to the r4.1 stable repository for dom0. To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

qubesos-bot avatar Jul 19 '22 03:07 qubesos-bot

I fear there may have been a regression somewhere as today I reexperienced this issue. On kernel 5.15.52 and was unable to qvm-run on any AppVM even if restarting it. Effectively locking up my system until a reboot.

Errors noted included the deferring g.e. 0x... as well as gntshr: error: ioctl failed: No space left on device.

Tonux599 avatar Aug 13 '22 17:08 Tonux599

I fear there may have been a regression somewhere as today I reexperienced this issue. On kernel 5.15.52 and was unable to qvm-run on any AppVM even if restarting it. Effectively locking up my system until a reboot.

Errors noted included the deferring g.e. 0x... as well as gntshr: error: ioctl failed: No space left on device.

Can you upgrade to a newer kernel and see if this fixes the problem?

DemiMarie avatar Aug 13 '22 23:08 DemiMarie

Hi @DemiMarie,

To confirm reproducibility, on kernel 5.15.52 (for both dom0 and AppVM's) running qvm-run {appvm} echo causes deferring g.e. 0x... in dmesg for dom0. Also resizing a window in said AppVM causes deferring g.e. 0x... messages in dmesg for that AppVM.

I can confirm that using kernel 5.18.9 (for both dom0 and AppVM's) causes these error messages to stop. However, resizing an AppVM window causes a trace in dmesg for dom0. Ultimately I would prefer to continue using kernel as opposed to kernel-latest.

The trace contains reference to WARNING: CPU: 0 PID: 5123 at drivers/xen/gntdev.c:399 __unmap_grant_pages_done+0xfe/0x110 [xen_gntdev]

I can provide a full trace, however I would want to PGP encrypt it to you as it contains hardware identifiers. I can't find your PGP key in qubes-secpack though? Alternatively I know your on matrix so I can send it there.

Tonux599 avatar Aug 14 '22 11:08 Tonux599

Those deferring messages are kind of expected, only still pending are worrying. It seems journald logs printk(KERN_DEBUG ...) messages just like any other, and it isn't obvious to distinguish them from actual errors by normal user. @DemiMarie can you change that to pr_debug, so it would be disabled by default? The leaking and still pending should be still logged by default.

As for WARNING, that's fixed in newer kernel already (there is https://github.com/QubesOS/updates-status/issues/3030 in testing repo).

marmarek avatar Aug 14 '22 15:08 marmarek

Thanks @marmarek good to know.

Having a look at the logs from yesterday when the lockup happened and noticed also the following, is this any concern?

xen:grant_table: xen/grant-table: max_grant_frames reached cur=2048 extra=1 limit=2048 gnttab_free_count=0 req_entries=1

If I recall correctly a hotfix for this issue used to be to increase /sys/module/xen_gntalloc/parameters/limit. So not sure if that log message from yesterday is a symptom of something?

Tonux599 avatar Aug 14 '22 15:08 Tonux599

I currently don't see it in dom0 on 5.10.128-1, but a lot in domU on 5.15.57-1.

3hhh avatar Aug 30 '22 15:08 3hhh

On Tue, Aug 30, 2022 at 08:02:43AM -0700, 3hhh wrote:

I currently don't see it in dom0 on 5.10.128-1, but a lot in domU on 5.15.57-1.

Dont see it at all with 5.15.63-1

unman avatar Aug 30 '22 15:08 unman

Those deferring messages are kind of expected, only still pending are worrying.

@marmarek When you say they are kind of expected, why are they thrown in the first place? This is flooding kmsg up to a point where it becomes unusable (i.e. for debugging). Is there anything we can do (i.e. kernel cmdline option) to suppress them?

ctr49 avatar Nov 19 '22 16:11 ctr49

@ctr49: they should be suppressed by default (that they are not is a bug).

DemiMarie avatar Nov 19 '22 20:11 DemiMarie

Right, and afaict it's caused by https://github.com/QubesOS/qubes-linux-kernel/blob/master/increase-reclaim-speed.patch#L124 (a bare printk, without pr_debug/pr_warn)

see https://github.com/QubesOS/qubes-linux-kernel/pull/682

ctr49 avatar Nov 20 '22 12:11 ctr49

but a lot in domU on 5.15.57-1.

This is rather old 5.15.x kernel. The bug was fixed in >=5.15.61 already. https://github.com/QubesOS/updates-status/issues/3037 was uploaded to the stable repo over 2 months ago.

I'm closing this now as resolved. Please let me know if you still can reproduce issue with up to date kernel.

marmarek avatar Nov 20 '22 14:11 marmarek

@marmarek I'm still seeing this on the very latest 6.0.8-1.fc32.qubes from current-testing

ctr49 avatar Nov 20 '22 17:11 ctr49