qubes-issues
qubes-issues copied to clipboard
Qubes *without* a GUI qube also has issues with granted pages and periodic crashes
Title referencing similar issue #7631
Qubes OS release
Qubes 4.1.1
Brief summary
When I open new windows, or resize existing ones (rarer as I use i3wm), I sometimes get errors in my syslogs regarding xorg tainting & granted pages.
I have noticed this issue while watching videos in Firefox and mpv, but the errors also appear when simply spawning new terminal windows for the same AppVM using i3's mod+return hotkey.
I have also been experiencing complete system freezes & reboots, which unfortunately have not produced any log results, so I do not know if the video playback is responsible or if it's solely on the window spawning.
I have been experiencing these issues since only recently, after some dom0 updates.
Steps to reproduce
- Open a new AppVM GUI window program under i3wm
- Create new GUI terminal windows for that same AppVM.
- In dom0, run
journalctl -S yesterday
and scroll towards the end.
Expected behavior
No errors should be appearing in the logs, no crashes should occur either.
Actual behavior
Errors appear in the logs, although not every time a new window is spawned.
Complete system crashes occur from time to time.
Logs
GPU
Intel integrated GPU, not PCI-forwarded and presumably unused.
Hi, the issues discussed in #7539 and #7631 do not involve a dedicated GUI qube/VM either. I reproduce those issues with a "normal" Qubes OS v4.1 installation that uses dom0 as the GUI host.
I was going to suggest marking this issue a duplicate of #7631, but I then realized that I have not encountered the kind of backtraces in the logs attached by @auroraanon38. Quoting from the attachment linked from this bug's (i.e., #7664) description:
BUG: Bad page map in process Xorg pte:8000000472962365 pmd:102338067
page:000000008f004d64 refcount:1025 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0xf5a1d
flags: 0x27ffffc0003408(dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
raw: 0027ffffc0003408 ffff8881735fca80 ffffea0003d68780 0000000000000000
raw: 0000000000000000 0001954700000007 00000401fffffffe 0000000000000000
page dumped because: bad pte
addr:00007204e3867000 vm_flags:1c0600f9 anon_vma:0000000000000000 mapping:ffff888100f8b468 index:2e9e4
file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
CPU: 5 PID: 3263 Comm: Xorg Tainted: G B W 5.15.52-1.fc32.qubes.x86_64 #1
Hardware name: [Redacted]
Call Trace:
<TASK>
dump_stack_lvl+0x46/0x5a
print_bad_pte.cold+0x6a/0xc5
zap_pte_range+0x388/0x7d0
? __raw_callee_save_xen_pmd_val+0x11/0x1e
zap_pmd_range.isra.0+0x1cc/0x2d0
zap_pud_range.isra.0+0xaa/0x1e0
unmap_page_range+0x17a/0x210
unmap_vmas+0x83/0x100
unmap_region+0xbd/0x120
__do_munmap+0x1f5/0x4e0
__vm_munmap+0x75/0x120
__x64_sys_munmap+0x28/0x30
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
These backtraces are not similar to the ones discussed in #7539 and #7631. Hmm...
I can confirm the above, I'm also affected by this.
I can also confirm this.
My journalctl is scattered with these bad page map errors from Xorg, and this highly correlates with dom0 freezes / crashes. Most crashes occur when opening a new i3wm window, but throughout a given day I can have anywhere between 1 and 10 bad page map events in my logs and generally no more than 1 dom0 crash, so this does not always take down dom0 apparently.
I can trace back when this issue first started appearing through journalctl, and this directly relates to just a few hours after a dom0 upgrade in my dnf history. At first I assumed this was related to the kernel upgrade that day, and I have been trying various downgrades since but without much luck. However, xen was also updated and based on this issue and various related xen_gntdev
issues I now also suspect it's related to xen.
These are the relevant xen package upgrades before these errors started happening:
...
Upgrade xen-2001:4.14.5-6.fc32.x86_64 @qubes-dom0-cached
Upgraded xen-2001:4.14.5-5.fc32.x86_64 @@System
Upgrade xen-hypervisor-2001:4.14.5-6.fc32.x86_64 @qubes-dom0-cached
Upgraded xen-hypervisor-2001:4.14.5-5.fc32.x86_64 @@System
Upgrade xen-libs-2001:4.14.5-6.fc32.x86_64 @qubes-dom0-cached
Upgraded xen-libs-2001:4.14.5-5.fc32.x86_64 @@System
Upgrade xen-licenses-2001:4.14.5-6.fc32.x86_64 @qubes-dom0-cached
Upgraded xen-licenses-2001:4.14.5-5.fc32.x86_64 @@System
Upgrade xen-runtime-2001:4.14.5-6.fc32.x86_64 @qubes-dom0-cached
Upgraded xen-runtime-2001:4.14.5-5.fc32.x86_64 @@System
...
I have since downgraded to 4.14.5-5
again and will report back if this has helped and if there was indeed a regression in the upgrade.
Note that I also found this thread (https://forum.qubes-os.org/t/qubesos-freeze-crash-and-reboots/12851) on the forum with lots of other users that started noticing similar dom0 freezes around the same time as me. Up until this issue my qubes installation has been extremely stable, and I can personally trace it back exactly to the dom0 update above.
Hopefully https://lore.kernel.org/xen-devel/[email protected]/t/#u will fix this. Marek, should Qubes take that?
Hey all, while downgrading xen*
appeared to had solved the issue initially, I did spot another set of bad page map errors after some more extended use. I have since downgraded everything back to before the upgrade when I first started being impacted by this issue, which makes it quite odd that the issue is still here as I don't see how else it could have appeared this suddenly.
Perhaps there was a backported patch that introduced this issue at one point?
Fingers crossed on the upcoming patch resolving the issue.
Hopefully https://lore.kernel.org/xen-devel/[email protected]/t/#u will fix this. Marek, should Qubes take that?
These patches were merged into Linux before 6.1-rc1 and backported to 5.19.17 and 6.0.3.
Hopefully https://lore.kernel.org/xen-devel/[email protected]/t/#u will fix this. Marek, should Qubes take that?
These patches were merged into Linux before 6.1-rc1 and backported to 5.19.17 and 6.0.3.
They are also in Greg’s patch queues for 5.15, but did not apply to older versions :(
5.19.75 is out, which has these fixes.
@DemiMarie, I just updated to 6.0.7-1 and noticed a lot of new warnings triggered at https://github.com/torvalds/linux/blob/f5020a08b2b371162a4a16ef97694cac3980397e/drivers/xen/gntdev.c#L406. I saw your patch trying to fix this problem (https://github.com/torvalds/linux/commit/166d3863231667c4f64dee72b77d1102cdfad11f), but unfortunately it didn't catch all cases. Seems to be related to this issue.
@mtdcr Whoops, sorry. @m-v-b do you have suggestions? The only one I can think of is to use WARN_ON_ONCE
but that is not a fix.
Hi @DemiMarie and @mtdcr,
I installed the kernel-latest-6.0.7-1.fc32.qubes.x86_64
and kernel-latest-qubes-vm-6.0.7-1.fc32.qubes.x86_64
packages in dom0 and started using them to see if I would encounter any warnings triggered by the line mentioned by @mtdcr, but in my (admittedly short) use of the 6.0.7 kernel, I did not encounter any gntdev
driver warnings.
@mtdcr, are there specific actions that reliably reproduce the warnings for you?
I'm unsure about the reliability aspect, but I reproduced it without causing much load. The warning appeared when I started my 10th VM (running simultaneously) after a reboot.
Just a wild guess: My machine uses 40 GiB of RAM. Eight VMs and dom0 each report a memory limit of around 4 GB. The other two use less than 0.5 GB each. Memory balancing is enabled for most VMs, though. I don't know how much memory Xen reserves for itself. Some additional memory may be used by the Intel GPU. The sum of it is close to the physical memory size. So maybe try to start enough VMs to reach your physical memory limit.
Any chance this could be due to fragmentation, perhaps interaction between grants and ballooning?
B
I wonder what part of the kernel is asking for physically-contiguous memory, as opposed to virtually-contiguous.
I'm a little puzzled by the current situation with this bug. I can still reproduce it with kernel-latest 6.0.8-1. Is the suggestion that we need that, plus a Xen patch that's not yet available? Or that even those together will not avoid yet solve this problem?
I took a look at issue #7785 after reading @dmoerner's comment today, and I saw that the i3 window manager was in use. This is might be the reason why I have not been able to reproduce the issue on my system with the kernel-latest packages (which I have been using since my last comment in this issue).
@mtdcr, are you also using a window manager other than xfwm4 (i.e., XFCE's window manager)?
In general, it might help to have the dmesg snippets from the affected VMs for triage. Log snippets acquired from dom0 with "xl dmesg" might also help, in case there are warning or error messages.
I'm unsure about the reliability aspect, but I reproduced it without causing much load. The warning appeared when I started my 10th VM (running simultaneously) after a reboot.
Just a wild guess: My machine uses 40 GiB of RAM. Eight VMs and dom0 each report a memory limit of around 4 GB. The other two use less than 0.5 GB each. Memory balancing is enabled for most VMs, though. I don't know how much memory Xen reserves for itself. Some additional memory may be used by the Intel GPU. The sum of it is close to the physical memory size. So maybe try to start enough VMs to reach your physical memory limit.
In the meantime, I've observed these warnings with only 5 VMs after a reboot, so my guess was wrong.
@mtdcr, are you also using a window manager other than xfwm4 (i.e., XFCE's window manager)?
No, I'm using Qubes' default WM.
In general, it might help to have the dmesg snippets from the affected VMs for triage. Log snippets acquired from dom0 with "xl dmesg" might also help, in case there are warning or error messages.
The warning from https://github.com/torvalds/linux/blob/f5020a08b2b371162a4a16ef97694cac3980397e/drivers/xen/gntdev.c#L406 appears in dom0's kernel messages. xl dmesg doesn't contain any warnings or errors.
Since the original issue was triggered by Xorg, likely with Intel GPUs, I looked up which driver I'm using and it's not the default one. Maybe this helps reproducing the issue.
Section "Device"
Identifier "Intel Graphics"
Driver "intel"
Option "TearFree" "true"
EndSection
mtdcr @.***> writes:
Since the original issue was triggered by Xorg, likely with Intel GPUs, I looked up which driver I'm using and it's not the default one. Maybe this helps reproducing the issue.
Section "Device" Identifier "Intel Graphics" Driver "intel" Option "TearFree" "true" EndSection
That aspect is missing from my original issue filing (I didn't think about it).
I have also set the driver explicitly in such a way, however without an Option line.
I had to set it originally as there was heavy visual artifacting and this here was recommended to me as the solution: https://github.com/QubesOS/qubes-issues/issues/4782#issuecomment-975920223 .
Section "Device" Identifier "Intel Graphics" Driver "intel" Option "TearFree" "true" EndSection
Removing this section made the warning disappear. I haven't noticed any artifacts since then.
mtdcr @.***> writes:
Section "Device" Identifier "Intel Graphics" Driver "intel" Option "TearFree" "true" EndSection
Removing this section made the warning disappear. I haven't noticed any artifacts since then.
I just tested, but unfortunately I do not get the same outcome. If I comment out the configuration & restart, the artifacting comes back.
I'll have to make another test later despite the artifacts to see if the grant issues do still happen when not using the intel driver.
I've since also removed the intel driver (by removing the section mentioned above), and gone back to the fbdev
driver. I was already running a picom compositor anyways, and I can confirm that the fbdev
driver with a compositing manager only produces very brief artifacts during startup when I assume picom has not yet started.
Since disabling the intel driver the "Bad page map" errors have disappeared, and Qubes is extremely stable once again.
I just tested, but unfortunately I do not get the same outcome. If I comment out the configuration & restart, the artifacting comes back.
I'll have to make another test later despite the artifacts to see if the grant issues do still happen when not using the intel driver.
Do you have a compositing manager (such as picom) installed? They generally fix all artifacting and tearing issues.
I'm also affected by this issue, it is a very serious one - Qubes is hard crashing (kernel panic?) in the middle of the work. On older kernel the issue begin, actually updated to 6.1. Using Qubes 4.1, tried updating to testing, actually updating does not resolve the issue.
Section "OutputClass"
Identifier "intel"
MatchDriver "i915"
Driver "intel"
Option "Accelethod" "sna"
Option "TearFree" "true"
Option "DRI" "3"
EndSection
Linux dom0 6.1.7-1.fc32.qubes.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jan 20 20:56:29 CET 2023 x86_64 x86_64 x86_64 GNU/Linux
Feb 21 21:21:17 dom0 kernel: BUG: Bad page map in process Xorg pte:80000007b42b8365 pmd:103359067
Feb 21 21:21:17 dom0 kernel: page:000000006ad760d6 refcount:1 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0x192c93
Feb 21 21:21:17 dom0 kernel: flags: 0x27ffffc000340a(referenced|dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
Feb 21 21:21:17 dom0 kernel: raw: 0027ffffc000340a ffff888109771480 ffffea00064b2500 0000000000000000
Feb 21 21:21:17 dom0 kernel: raw: 0000000000000000 000030420000000a 00000001fffffffe 0000000000000000
Feb 21 21:21:17 dom0 kernel: page dumped because: bad pte
Feb 21 21:21:17 dom0 kernel: addr:0000763977266000 vm_flags:1c0600f9 anon_vma:0000000000000000 mapping:ffff888108fcc208 index:7
Feb 21 21:21:17 dom0 kernel: file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] read_folio:0x0
Feb 21 21:21:17 dom0 kernel: CPU: 3 PID: 9310 Comm: Xorg Tainted: G B W 6.1.7-1.fc32.qubes.x86_64 #1
Feb 21 21:21:17 dom0 kernel: Hardware name: [Redacted]
Feb 21 21:21:17 dom0 kernel: Call Trace:
Feb 21 21:21:17 dom0 kernel: <TASK>
Feb 21 21:21:17 dom0 kernel: dump_stack_lvl+0x45/0x5e
Feb 21 21:21:17 dom0 kernel: print_bad_pte.cold+0x61/0xb9
Feb 21 21:21:17 dom0 kernel: zap_pte_range+0x556/0xa40
Feb 21 21:21:17 dom0 kernel: zap_pmd_range.isra.0+0x1b9/0x2f0
Feb 21 21:21:17 dom0 kernel: zap_pud_range.isra.0+0xb0/0x200
Feb 21 21:21:17 dom0 kernel: unmap_page_range+0x1d2/0x320
Feb 21 21:21:17 dom0 kernel: unmap_vmas+0xea/0x180
Feb 21 21:21:17 dom0 kernel: unmap_region+0xb9/0x120
Feb 21 21:21:17 dom0 kernel: do_mas_align_munmap+0x332/0x4b0
Feb 21 21:21:17 dom0 kernel: do_mas_munmap+0xd9/0x130
Feb 21 21:21:17 dom0 kernel: __vm_munmap+0xb8/0x170
Feb 21 21:21:17 dom0 kernel: __x64_sys_munmap+0x17/0x20
Feb 21 21:21:17 dom0 kernel: do_syscall_64+0x59/0x90
Feb 21 21:21:17 dom0 kernel: ? syscall_exit_to_user_mode+0x17/0x40
Feb 21 21:21:17 dom0 kernel: ? do_syscall_64+0x69/0x90
Feb 21 21:21:17 dom0 kernel: ? do_syscall_64+0x69/0x9
Regarding the Xorg configuration I also fixed the visual artifacts as presented there https://github.com/QubesOS/qubes-issues/issues/4782#issuecomment-1064478081 with that section. That's why I have such Xorg configuration.
@blackandred can you go back to modesetting
and see if you still have problems?
@DemiMarie I commented out the Xorg settings and I'm observing the logs in systemd in dom0 - you mean this my switching to modesetting
or something else?
@DemiMarie I commented out the Xorg settings and I'm observing the logs in systemd in dom0 - you mean this my switching to
modesetting
or something else?
What if you explicitly enable the modesetting
driver?
Yesterday in the morning I commented out the Xorg config related to the Intel driver. I currently don't have explicite modesetting
configuration yet - testing a blank config. At this moment - after a day I do not see any issues, two possibly unrelated stacktraces about the irq and sytememd-sleep, running at the full capacity (11+ Qubes, Xen memory warning in dom0 journal). I will keep watching few days and let you know, because the issue was happening randomly - sometimes after an hour, sometimes after few days.