liquorix-package icon indicating copy to clipboard operation
liquorix-package copied to clipboard

Kernel version 5.18 breaks with RTX 3080 Mobile

Open Pyrestone opened this issue 3 years ago • 2 comments

Hi,

The error message that is most familiar to me correlated with this issue is this one:

[   30.848138] NVRM: GPU at PCI:0000:01:00: GPU-e5a2c765-97ab-76de-eaf8-021ea4ed93bc
[   30.848143] NVRM: Xid (PCI:0000:01:00): 79, pid=2835, GPU has fallen off the bus.
[   30.848146] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Symptoms include Xorg crashing and the entire screen freezing permanently until (forceful) reboot.
The machine is still reachable by ssh (which is how i got any error message in the first place).

I had this issue before on the ubuntu 20.04 default kernel (5.13 and 5.14 I think), and switched to liquorix (version 5.16.0-11.1) as a way to get a newer kernel where this issue was supposedly fixed.
I had an original thread for that issue in the NVIDIA developer forums Up until versions 5.17, the bug did not reappear, but then the liquorix package updated to 5.18.0-16 and the crashes reappeared.

I suspect that the crash-causing changes appear somewhere between 5.18.0-10 and 5.18.0-12 but I can't confirm that, since 5.18.0-12 is the latest version of liquorix that is offered on apt. 5.18.0-10 was temporarily stable for me, but I can't guarantee that since I didn't try it for that long.

Currently, 5.17.0-15 is stable for me (with both nvidia-driver-515 and nvidia-driver-510), which is what I currently use as a workaround, but I would like to get this issue re-fixed in the future.

Could you please take a look at this?

I'd be willing to install different kernel versions to bisect the error if that's necessary.

Pyrestone avatar Aug 04 '22 11:08 Pyrestone

Here's the output of $ inxi -bxxzG

System:    Kernel: 5.17.0-15.1-liquorix-amd64 x86_64 bits: 64 compiler: N/A Desktop: Gnome 3.36.9 wm: gnome-shell dm: GDM3 
           Distro: Ubuntu 20.04.4 LTS (Focal Fossa) 
Machine:   Type: Laptop System: GIGABYTE product: AORUS 15P YD v: 993AC serial: <filter> Chassis: SYS_CHASSIS_ type: 10 v: y.y 
           serial: <filter> 
           Mobo: GIGABYTE model: AORUS 15P YD serial: <filter> UEFI: American Megatrends LLC. v: FB07 date: 10/07/2021 
Battery:   ID-1: BAT1 charge: 99.0 Wh condition: 99.0/99.0 Wh (100%) volts: 17.1/15.2 model: GIGABYTE Aero 15 serial: <filter> 
           status: Full 
CPU:       8-Core: 11th Gen Intel Core i7-11800H type: MT MCP arch: N/A speed: 1019 MHz min/max: 800/4600 MHz 
Graphics:  Device-1: Intel vendor: Gigabyte driver: i915 v: kernel bus ID: 00:02.0 chip ID: 8086:9a60 
           Device-2: NVIDIA vendor: Gigabyte driver: nvidia v: 515.65.01 bus ID: 01:00.0 chip ID: 10de:249c 
           Display: x11 server: X.Org 1.20.13 driver: none compositor: gnome-shell resolution: 2560x1440~60Hz, 1920x1080~240Hz 
           OpenGL: renderer: NVIDIA GeForce RTX 3080 Laptop GPU/PCIe/SSE2 v: 4.6.0 NVIDIA 515.65.01 direct render: Yes 
Network:   Device-1: Realtek RTL8125 2.5GbE vendor: Gigabyte driver: r8169 v: kernel port: 3000 bus ID: 2e:00.0 
           chip ID: 10ec:8125 
           Device-2: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel port: 3000 bus ID: 30:00.0 chip ID: 8086:2723 
Drives:    Local Storage: total: 953.87 GiB used: 1.05 TiB (113.0%) 
Info:      Processes: 386 Uptime: 29m Memory: 31.11 GiB used: 3.27 GiB (10.5%) Init: systemd v: 245 runlevel: 5 Compilers: 
           gcc: 9.4.0 alt: 10/8/9 clang: 10.0.0-4ubuntu1 Shell: bash v: 5.0.17 running in: gnome-terminal inxi: 3.0.38`

Pyrestone avatar Aug 04 '22 11:08 Pyrestone

There's two things you can try from some quick research of mine.

  1. Try adding mitigations=off to your kernel cmdline. Kernel 5.18.14 added new CPU vulnerability mitigations and that might be affecting your system strangely.
  2. Add pcie_aspm=off to disable power management of PCIe lanes. It could be the case that power management has changed slightly and is now affecting you, causing the error of GPU has fallen off the bus.

Let me know if that helps.

EDIT: On kernel 5.18.0-12.2 / 5.18-13, the PCIe ASPM configuration was changed to leave ASPM settings at BIOS defaults. This was required to allow for deeper power states on some laptops. This puts Liquorix in line with most other kernels, including stock. https://github.com/damentz/liquorix-package/commit/e7721911392e81edfcb335f8790a5750a3a65c38

damentz avatar Aug 05 '22 16:08 damentz

Can report that the bug still exists for kernel "vmlinuz-5.19.0-12.3-liquorix-amd64" installed via apt. mitigations=off didn't help as it crashed again.

This is becoming an issue for me as me and some of my colleagues use this laptop, for which now no stable liquorix kernel exists on apt. I only have "linux-image-5.17.0-15.1-liquorix-amd64/now 5.17-19ubuntu1~focal" locally now. If I uninstall the known working version, I have no way of getting it back since it's no longer in apt (only 5.19.x).

Pyrestone avatar Oct 04 '22 15:10 Pyrestone

Addendum: The option pcie_aspm=off also crashes relatively quickly on 5.19.0 This means there is currently no known stable version for me on apt.

@damentz is there a way that I can build/install the version I currently have in some way which is not via the liquorix repo (which no longer has it)?

Or can I at least export my local version of the apt package and back it up somewhere so that my colleagues and I have a stable verson of the kernel lying around somewhere?

Pyrestone avatar Oct 04 '22 15:10 Pyrestone

Can you try passing pci=pci_bus_perf to your kernel cmdline? This goes back to the original change that "broke" your system. This option will technically re-enable the PCI performance override option that used to be enabled on Liquorix.

damentz avatar Oct 04 '22 16:10 damentz

will try and report back, thanks.

By the way, do you have a manual for building the deb files for older versions?

Pyrestone avatar Oct 04 '22 16:10 Pyrestone

update: pci=pci_bus_perf also crashes immediately

Pyrestone avatar Oct 04 '22 16:10 Pyrestone

tried all 3 options together, still crashed: cat /proc/cmdline yielded: audit=0 intel_pstate=disable hpet=disable rcupdate.rcu_expedited=1 BOOT_IMAGE=/boot/vmlinuz-5.19.0-12.3-liquorix-amd64 root=UUID=... ro quiet splash mitigations=off pcie_aspm=off pci=pci_bus_perf vt.handoff=7

I'm starting to suspect this regression might be due to a change in the underlying kernel version, not the liquorix parameters. I just have no idea how to debug/bisect this at this point, because I can't install any of the versions that worked.

Pyrestone avatar Oct 04 '22 16:10 Pyrestone

btw, adding mitigations=off pcie_aspm=off to the 5.17 cmdline also sank the boat :P

Pyrestone avatar Oct 04 '22 16:10 Pyrestone

Anything new here with latest kernel revisions? I noticed one of the links you posted [1], the OP returned their Dell device and got a Lenovo instead. Has had no issues since.

[1] https://forums.developer.nvidia.com/t/device-not-found-ubuntu-20-04-dell-precision-rtx-a4000-rminitadapter-failed/202819/18

damentz avatar Feb 01 '23 03:02 damentz

Due to lack of time for experimentation with an unstable system, I stuck with version 5.17 for now. (liquorix-image-5.17.0-15.1, apt version 5.17-19ubuntu1~focal) which I found a backup of in the liquorix forum and consequently stored my own version of the deb file.

I might be willing to take a weekend sometime for bisecting/trying recent versions. but anything no longer on apt I don't know how to install. so I can't exactly start from my working version and step forward until it breaks :/

Pyrestone avatar Feb 01 '23 11:02 Pyrestone

I'm going to close this out for now, I'm not receiving any other reports so most likely it's some strange interaction on your specific hardware. Let me know if you find anything new that might help if you still are running into issues.

damentz avatar Mar 08 '23 23:03 damentz