qubes-issues icon indicating copy to clipboard operation
qubes-issues copied to clipboard

Qubes *without* a GUI qube also has issues with granted pages and periodic crashes

Open auroraanon38 opened this issue 2 years ago • 5 comments

Title referencing similar issue #7631

Qubes OS release

Qubes 4.1.1

Brief summary

When I open new windows, or resize existing ones (rarer as I use i3wm), I sometimes get errors in my syslogs regarding xorg tainting & granted pages.

I have noticed this issue while watching videos in Firefox and mpv, but the errors also appear when simply spawning new terminal windows for the same AppVM using i3's mod+return hotkey.

I have also been experiencing complete system freezes & reboots, which unfortunately have not produced any log results, so I do not know if the video playback is responsible or if it's solely on the window spawning.

I have been experiencing these issues since only recently, after some dom0 updates.

Steps to reproduce

  • Open a new AppVM GUI window program under i3wm
  • Create new GUI terminal windows for that same AppVM.
  • In dom0, run journalctl -S yesterday and scroll towards the end.

Expected behavior

No errors should be appearing in the logs, no crashes should occur either.

Actual behavior

Errors appear in the logs, although not every time a new window is spawned.

Complete system crashes occur from time to time.

Logs

xorg-grant-errors.log

GPU

Intel integrated GPU, not PCI-forwarded and presumably unused.

auroraanon38 avatar Jul 30 '22 03:07 auroraanon38

Hi, the issues discussed in #7539 and #7631 do not involve a dedicated GUI qube/VM either. I reproduce those issues with a "normal" Qubes OS v4.1 installation that uses dom0 as the GUI host.

I was going to suggest marking this issue a duplicate of #7631, but I then realized that I have not encountered the kind of backtraces in the logs attached by @auroraanon38. Quoting from the attachment linked from this bug's (i.e., #7664) description:

BUG: Bad page map in process Xorg  pte:8000000472962365 pmd:102338067
page:000000008f004d64 refcount:1025 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0xf5a1d
flags: 0x27ffffc0003408(dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
raw: 0027ffffc0003408 ffff8881735fca80 ffffea0003d68780 0000000000000000
raw: 0000000000000000 0001954700000007 00000401fffffffe 0000000000000000
page dumped because: bad pte
addr:00007204e3867000 vm_flags:1c0600f9 anon_vma:0000000000000000 mapping:ffff888100f8b468 index:2e9e4
file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
CPU: 5 PID: 3263 Comm: Xorg Tainted: G    B   W         5.15.52-1.fc32.qubes.x86_64 #1
Hardware name: [Redacted]
Call Trace:
 <TASK>
 dump_stack_lvl+0x46/0x5a
 print_bad_pte.cold+0x6a/0xc5
 zap_pte_range+0x388/0x7d0
 ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 zap_pmd_range.isra.0+0x1cc/0x2d0
 zap_pud_range.isra.0+0xaa/0x1e0
 unmap_page_range+0x17a/0x210
 unmap_vmas+0x83/0x100
 unmap_region+0xbd/0x120
 __do_munmap+0x1f5/0x4e0
 __vm_munmap+0x75/0x120
 __x64_sys_munmap+0x28/0x30
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

These backtraces are not similar to the ones discussed in #7539 and #7631. Hmm...

m-v-b avatar Aug 08 '22 07:08 m-v-b

I can confirm the above, I'm also affected by this.

TheOne-Z avatar Aug 17 '22 13:08 TheOne-Z

I can also confirm this.

My journalctl is scattered with these bad page map errors from Xorg, and this highly correlates with dom0 freezes / crashes. Most crashes occur when opening a new i3wm window, but throughout a given day I can have anywhere between 1 and 10 bad page map events in my logs and generally no more than 1 dom0 crash, so this does not always take down dom0 apparently.

I can trace back when this issue first started appearing through journalctl, and this directly relates to just a few hours after a dom0 upgrade in my dnf history. At first I assumed this was related to the kernel upgrade that day, and I have been trying various downgrades since but without much luck. However, xen was also updated and based on this issue and various related xen_gntdev issues I now also suspect it's related to xen.

These are the relevant xen package upgrades before these errors started happening:

    ...
    Upgrade  xen-2001:4.14.5-6.fc32.x86_64                    @qubes-dom0-cached
    Upgraded xen-2001:4.14.5-5.fc32.x86_64                    @@System
    Upgrade  xen-hypervisor-2001:4.14.5-6.fc32.x86_64         @qubes-dom0-cached
    Upgraded xen-hypervisor-2001:4.14.5-5.fc32.x86_64         @@System
    Upgrade  xen-libs-2001:4.14.5-6.fc32.x86_64               @qubes-dom0-cached
    Upgraded xen-libs-2001:4.14.5-5.fc32.x86_64               @@System
    Upgrade  xen-licenses-2001:4.14.5-6.fc32.x86_64           @qubes-dom0-cached
    Upgraded xen-licenses-2001:4.14.5-5.fc32.x86_64           @@System
    Upgrade  xen-runtime-2001:4.14.5-6.fc32.x86_64            @qubes-dom0-cached
    Upgraded xen-runtime-2001:4.14.5-5.fc32.x86_64            @@System
    ...

I have since downgraded to 4.14.5-5 again and will report back if this has helped and if there was indeed a regression in the upgrade.

Note that I also found this thread (https://forum.qubes-os.org/t/qubesos-freeze-crash-and-reboots/12851) on the forum with lots of other users that started noticing similar dom0 freezes around the same time as me. Up until this issue my qubes installation has been extremely stable, and I can personally trace it back exactly to the dom0 update above.

geo-mathijs avatar Sep 16 '22 14:09 geo-mathijs

Hopefully https://lore.kernel.org/xen-devel/[email protected]/t/#u will fix this. Marek, should Qubes take that?

DemiMarie avatar Sep 16 '22 15:09 DemiMarie

Hey all, while downgrading xen* appeared to had solved the issue initially, I did spot another set of bad page map errors after some more extended use. I have since downgraded everything back to before the upgrade when I first started being impacted by this issue, which makes it quite odd that the issue is still here as I don't see how else it could have appeared this suddenly.

Perhaps there was a backported patch that introduced this issue at one point?

Fingers crossed on the upcoming patch resolving the issue.

geo-mathijs avatar Sep 18 '22 15:09 geo-mathijs

Hopefully https://lore.kernel.org/xen-devel/[email protected]/t/#u will fix this. Marek, should Qubes take that?

These patches were merged into Linux before 6.1-rc1 and backported to 5.19.17 and 6.0.3.

mtdcr avatar Oct 25 '22 11:10 mtdcr

Hopefully https://lore.kernel.org/xen-devel/[email protected]/t/#u will fix this. Marek, should Qubes take that?

These patches were merged into Linux before 6.1-rc1 and backported to 5.19.17 and 6.0.3.

They are also in Greg’s patch queues for 5.15, but did not apply to older versions :(

DemiMarie avatar Oct 25 '22 18:10 DemiMarie

5.19.75 is out, which has these fixes.

DemiMarie avatar Oct 27 '22 16:10 DemiMarie

@DemiMarie, I just updated to 6.0.7-1 and noticed a lot of new warnings triggered at https://github.com/torvalds/linux/blob/f5020a08b2b371162a4a16ef97694cac3980397e/drivers/xen/gntdev.c#L406. I saw your patch trying to fix this problem (https://github.com/torvalds/linux/commit/166d3863231667c4f64dee72b77d1102cdfad11f), but unfortunately it didn't catch all cases. Seems to be related to this issue.

mtdcr avatar Nov 11 '22 22:11 mtdcr

@mtdcr Whoops, sorry. @m-v-b do you have suggestions? The only one I can think of is to use WARN_ON_ONCE but that is not a fix.

DemiMarie avatar Nov 11 '22 23:11 DemiMarie

Hi @DemiMarie and @mtdcr,

I installed the kernel-latest-6.0.7-1.fc32.qubes.x86_64 and kernel-latest-qubes-vm-6.0.7-1.fc32.qubes.x86_64 packages in dom0 and started using them to see if I would encounter any warnings triggered by the line mentioned by @mtdcr, but in my (admittedly short) use of the 6.0.7 kernel, I did not encounter any gntdev driver warnings.

@mtdcr, are there specific actions that reliably reproduce the warnings for you?

m-v-b avatar Nov 12 '22 00:11 m-v-b

I'm unsure about the reliability aspect, but I reproduced it without causing much load. The warning appeared when I started my 10th VM (running simultaneously) after a reboot.

Just a wild guess: My machine uses 40 GiB of RAM. Eight VMs and dom0 each report a memory limit of around 4 GB. The other two use less than 0.5 GB each. Memory balancing is enabled for most VMs, though. I don't know how much memory Xen reserves for itself. Some additional memory may be used by the Intel GPU. The sum of it is close to the physical memory size. So maybe try to start enough VMs to reach your physical memory limit.

mtdcr avatar Nov 12 '22 22:11 mtdcr

Any chance this could be due to fragmentation, perhaps interaction between grants and ballooning?

B

brendanhoar avatar Nov 12 '22 23:11 brendanhoar

I wonder what part of the kernel is asking for physically-contiguous memory, as opposed to virtually-contiguous.

DemiMarie avatar Nov 13 '22 02:11 DemiMarie

I'm a little puzzled by the current situation with this bug. I can still reproduce it with kernel-latest 6.0.8-1. Is the suggestion that we need that, plus a Xen patch that's not yet available? Or that even those together will not avoid yet solve this problem?

dmoerner avatar Nov 18 '22 20:11 dmoerner

I took a look at issue #7785 after reading @dmoerner's comment today, and I saw that the i3 window manager was in use. This is might be the reason why I have not been able to reproduce the issue on my system with the kernel-latest packages (which I have been using since my last comment in this issue).

@mtdcr, are you also using a window manager other than xfwm4 (i.e., XFCE's window manager)?


In general, it might help to have the dmesg snippets from the affected VMs for triage. Log snippets acquired from dom0 with "xl dmesg" might also help, in case there are warning or error messages.

m-v-b avatar Nov 20 '22 13:11 m-v-b

I'm unsure about the reliability aspect, but I reproduced it without causing much load. The warning appeared when I started my 10th VM (running simultaneously) after a reboot.

Just a wild guess: My machine uses 40 GiB of RAM. Eight VMs and dom0 each report a memory limit of around 4 GB. The other two use less than 0.5 GB each. Memory balancing is enabled for most VMs, though. I don't know how much memory Xen reserves for itself. Some additional memory may be used by the Intel GPU. The sum of it is close to the physical memory size. So maybe try to start enough VMs to reach your physical memory limit.

In the meantime, I've observed these warnings with only 5 VMs after a reboot, so my guess was wrong.

mtdcr avatar Nov 26 '22 12:11 mtdcr

@mtdcr, are you also using a window manager other than xfwm4 (i.e., XFCE's window manager)?

No, I'm using Qubes' default WM.

mtdcr avatar Nov 26 '22 12:11 mtdcr

In general, it might help to have the dmesg snippets from the affected VMs for triage. Log snippets acquired from dom0 with "xl dmesg" might also help, in case there are warning or error messages.

The warning from https://github.com/torvalds/linux/blob/f5020a08b2b371162a4a16ef97694cac3980397e/drivers/xen/gntdev.c#L406 appears in dom0's kernel messages. xl dmesg doesn't contain any warnings or errors.

mtdcr avatar Nov 26 '22 12:11 mtdcr

Since the original issue was triggered by Xorg, likely with Intel GPUs, I looked up which driver I'm using and it's not the default one. Maybe this helps reproducing the issue.

Section "Device"
	Identifier "Intel Graphics"
	Driver "intel"
	Option "TearFree" "true"
EndSection

mtdcr avatar Nov 26 '22 12:11 mtdcr

mtdcr @.***> writes:

Since the original issue was triggered by Xorg, likely with Intel GPUs, I looked up which driver I'm using and it's not the default one. Maybe this helps reproducing the issue.

Section "Device"
	Identifier "Intel Graphics"
	Driver "intel"
	Option "TearFree" "true"
EndSection

That aspect is missing from my original issue filing (I didn't think about it).

I have also set the driver explicitly in such a way, however without an Option line.

I had to set it originally as there was heavy visual artifacting and this here was recommended to me as the solution: https://github.com/QubesOS/qubes-issues/issues/4782#issuecomment-975920223 .

auroraanon38 avatar Dec 09 '22 00:12 auroraanon38

Section "Device"
	Identifier "Intel Graphics"
	Driver "intel"
	Option "TearFree" "true"
EndSection

Removing this section made the warning disappear. I haven't noticed any artifacts since then.

mtdcr avatar Dec 13 '22 19:12 mtdcr

mtdcr @.***> writes:

Section "Device"
	Identifier "Intel Graphics"
	Driver "intel"
	Option "TearFree" "true"
EndSection

Removing this section made the warning disappear. I haven't noticed any artifacts since then.

I just tested, but unfortunately I do not get the same outcome. If I comment out the configuration & restart, the artifacting comes back.

I'll have to make another test later despite the artifacts to see if the grant issues do still happen when not using the intel driver.

auroraanon38 avatar Dec 15 '22 03:12 auroraanon38

I've since also removed the intel driver (by removing the section mentioned above), and gone back to the fbdev driver. I was already running a picom compositor anyways, and I can confirm that the fbdev driver with a compositing manager only produces very brief artifacts during startup when I assume picom has not yet started.

Since disabling the intel driver the "Bad page map" errors have disappeared, and Qubes is extremely stable once again.

I just tested, but unfortunately I do not get the same outcome. If I comment out the configuration & restart, the artifacting comes back.

I'll have to make another test later despite the artifacts to see if the grant issues do still happen when not using the intel driver.

Do you have a compositing manager (such as picom) installed? They generally fix all artifacting and tearing issues.

geo-mathijs avatar Feb 20 '23 14:02 geo-mathijs

I'm also affected by this issue, it is a very serious one - Qubes is hard crashing (kernel panic?) in the middle of the work. On older kernel the issue begin, actually updated to 6.1. Using Qubes 4.1, tried updating to testing, actually updating does not resolve the issue.

Section "OutputClass"
    Identifier "intel"
    MatchDriver "i915"
    Driver "intel"
    Option "Accelethod" "sna"
    Option "TearFree" "true"
    Option "DRI" "3"
EndSection
Linux dom0 6.1.7-1.fc32.qubes.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jan 20 20:56:29 CET 2023 x86_64 x86_64 x86_64 GNU/Linux
Feb 21 21:21:17 dom0 kernel: BUG: Bad page map in process Xorg  pte:80000007b42b8365 pmd:103359067
Feb 21 21:21:17 dom0 kernel: page:000000006ad760d6 refcount:1 mapcount:-1 mapping:0000000000000000 index:0x0 pfn:0x192c93
Feb 21 21:21:17 dom0 kernel: flags: 0x27ffffc000340a(referenced|dirty|owner_priv_1|reserved|private|node=0|zone=4|lastcpupid=0x1fffff)
Feb 21 21:21:17 dom0 kernel: raw: 0027ffffc000340a ffff888109771480 ffffea00064b2500 0000000000000000
Feb 21 21:21:17 dom0 kernel: raw: 0000000000000000 000030420000000a 00000001fffffffe 0000000000000000
Feb 21 21:21:17 dom0 kernel: page dumped because: bad pte
Feb 21 21:21:17 dom0 kernel: addr:0000763977266000 vm_flags:1c0600f9 anon_vma:0000000000000000 mapping:ffff888108fcc208 index:7
Feb 21 21:21:17 dom0 kernel: file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] read_folio:0x0
Feb 21 21:21:17 dom0 kernel: CPU: 3 PID: 9310 Comm: Xorg Tainted: G    B   W          6.1.7-1.fc32.qubes.x86_64 #1
Feb 21 21:21:17 dom0 kernel: Hardware name: [Redacted]
Feb 21 21:21:17 dom0 kernel: Call Trace:
Feb 21 21:21:17 dom0 kernel:  <TASK>
Feb 21 21:21:17 dom0 kernel:  dump_stack_lvl+0x45/0x5e
Feb 21 21:21:17 dom0 kernel:  print_bad_pte.cold+0x61/0xb9
Feb 21 21:21:17 dom0 kernel:  zap_pte_range+0x556/0xa40
Feb 21 21:21:17 dom0 kernel:  zap_pmd_range.isra.0+0x1b9/0x2f0
Feb 21 21:21:17 dom0 kernel:  zap_pud_range.isra.0+0xb0/0x200
Feb 21 21:21:17 dom0 kernel:  unmap_page_range+0x1d2/0x320
Feb 21 21:21:17 dom0 kernel:  unmap_vmas+0xea/0x180
Feb 21 21:21:17 dom0 kernel:  unmap_region+0xb9/0x120
Feb 21 21:21:17 dom0 kernel:  do_mas_align_munmap+0x332/0x4b0
Feb 21 21:21:17 dom0 kernel:  do_mas_munmap+0xd9/0x130
Feb 21 21:21:17 dom0 kernel:  __vm_munmap+0xb8/0x170
Feb 21 21:21:17 dom0 kernel:  __x64_sys_munmap+0x17/0x20
Feb 21 21:21:17 dom0 kernel:  do_syscall_64+0x59/0x90
Feb 21 21:21:17 dom0 kernel:  ? syscall_exit_to_user_mode+0x17/0x40
Feb 21 21:21:17 dom0 kernel:  ? do_syscall_64+0x69/0x90
Feb 21 21:21:17 dom0 kernel:  ? do_syscall_64+0x69/0x9

blackandred avatar Feb 21 '23 20:02 blackandred

Regarding the Xorg configuration I also fixed the visual artifacts as presented there https://github.com/QubesOS/qubes-issues/issues/4782#issuecomment-1064478081 with that section. That's why I have such Xorg configuration.

blackandred avatar Feb 21 '23 20:02 blackandred

@blackandred can you go back to modesetting and see if you still have problems?

DemiMarie avatar Feb 22 '23 01:02 DemiMarie

@DemiMarie I commented out the Xorg settings and I'm observing the logs in systemd in dom0 - you mean this my switching to modesetting or something else?

blackandred avatar Feb 22 '23 07:02 blackandred

@DemiMarie I commented out the Xorg settings and I'm observing the logs in systemd in dom0 - you mean this my switching to modesetting or something else?

What if you explicitly enable the modesetting driver?

DemiMarie avatar Feb 22 '23 17:02 DemiMarie

Yesterday in the morning I commented out the Xorg config related to the Intel driver. I currently don't have explicite modesetting configuration yet - testing a blank config. At this moment - after a day I do not see any issues, two possibly unrelated stacktraces about the irq and sytememd-sleep, running at the full capacity (11+ Qubes, Xen memory warning in dom0 journal). I will keep watching few days and let you know, because the issue was happening randomly - sometimes after an hour, sometimes after few days.

blackandred avatar Feb 23 '23 06:02 blackandred