NVIDIA GPU kernel modules fail to load on ARM kernels with BTI enabled due to missing BTI instructions in the compiled module.

Open arnav-kansal opened this issue 11 months ago • 1 comments

NVIDIA Open GPU Kernel Modules Version

all

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

[x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Container-Optimized OS

Kernel Release

6.6.72

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

[x] I am running on a stable kernel release.

Hardware: GPU

B200

Describe the bug

When attempting to load the NVIDIA driver on a BTI-enabled ARM64 kernel, a kernel panic occurs with the following call trace:

[  164.260150] LoadPin: kernel-module pinning-excluded obj="/var/lib/nvidia/drivers/nvidia.ko" pid=780 cmdline="insmod /var/lib/nvidia/drivers/nvidia.ko" 
[  165.050761] nvidia-nvlink: Nvlink Core is being initialized, major device number 242 
[  165.058515] Internal error: Oops - BTI: 0000000036000002 [#1] SMP
[  165.066266] Modules linked in: nvidia(O+) nft_compat nf_tables mlx5_ib ib_uverbs ib_core mlx5_core mlxfw ptp pps_core loadpin_trigger(O) fuse configfs
[  165.079464] CPU: 1 PID: 780 Comm: insmod Tainted: G           O       6.6.72+
 #1
[  165.086622] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 
[  165.095352] pstate: 23400805 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=-c)
[  165.102149] pc : _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]     
[  165.108885] lr : _portMemAllocatorAlloc+0x2c/0x250 [nvidia]   
[  165.114484] sp : ffff800080abb7e0
[  165.117723] x29: ffff800080abb7e0 x28: ffff80008bc9af88 x27: 0000000000000000
[  165.124701] x26: ffffaececa6c52a8 x25: ffffaece94370188 x24: ffffaececa58c008
[  165.131521] x23: ffffaececa5b9ad0 x22: ffffaece946a5000 x21: ffff0000c333f000
[  165.138327] x20: ffffaece946cb7e0 x19: 0000000000000010 x18: ffff8000805e7058
[  165.145162] x17: 72643d4d45545359 x16: ffffaecec8fa58a8 x15: 0000000000000000
[  165.152147] x14: 0000000000000000 x13: 0000000000800000 x12: 000d272730000000
[  165.159075] x11: 0000000000000017 x10: 00000000000f4240 x9 : 000000000006c00d
[  165.165833] x8 : ffffaece94690b70 x7 : 4150564544006464 x6 : 0000000000000018
[  165.172672] x5 : 0000000000000000 x4 : ffff800080abb720 x3 : 0000000000000001
[  165.179452] x2 : 0000000000000000 x1 : 000000000000002c x0 : ffffaece946cb7e0
[  165.186297] Call trace:
[  165.188626]  _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]         
[  165.194827]  portCryptoInitialize+0x2c/0xc0 [nvidia]  
[  165.199843]  portInitialize+0x30/0x40 [nvidia]  
[  165.204288]  RmInitRm+0x14/0x80 [nvidia] 
[  165.208243]  rm_init_rm+0xc/0x20 [nvidia]  
[  165.212325]  nv_module_init+0x8c/0xe8 [nvidia]
[  165.216861]  init_module+0xfc/0x2c0 [nvidia]
[  165.221121]  do_one_initcall+0x134/0x2b8
[  165.225058]  do_init_module+0x64/0x208
[  165.228648]  load_module+0x1184/0x1300 
[  165.232227]  __arm64_sys_finit_module+0x238/0x370
[  165.236748]  invoke_syscall+0x5c/0x128 
[  165.240324]  el0_svc_common+0x90/0xf8
[  165.243811]  do_el0_svc+0x2c/0x48 
[  165.247049]  el0_svc+0x40/0x98
[  165.250002]  el0t_64_sync_handler+0x8c/0x108 
[  165.254149]  el0t_64_sync+0x198/0x1a0 
[  165.257693] Code: aa0103e0 140003db 00000000 00000000 (aa0103e0) 
[  165.263492] ---[ end trace 0000000000000000 ]--- 
[  165.322068] Kernel panic - not syncing: Oops - BTI: Fatal exception
[  165.328507] SMP: stopping secondary CPUs
[  165.332519] Kernel Offset: 0x2ece48e90000 from 0xffff800080000000
[  165.338348] PHYS_OFFSET: 0x40000000  
[  165.341679] CPU features: 0x0,00000000,50024c43,15267ea7 
[  165.346968] Memory Limit: none

The crash occurs within the _portMemAllocatorAllocNonPagedWrapper function. Disassembly of this function shows the lack of BTI instructions where expected -

$ llvm-objdump nvidia.ko --disassemble-symbols=_portMemAllocatorAllocNonPagedWrapper

nvidia.ko:   file format elf64-littleaarch64

Disassembly of section .text:

000000000031af90 <_portMemAllocatorAllocNonPagedWrapper>:
  31af90: aa0103e0      mov     x0, x1
  31af94: 14000000      b       0x31af94 <_portMemAllocatorAllocNonPagedWrapper+0x4>
                ...
$ llvm-objdump nvidia.ko --disassemble-symbols=_portMemAllocatorAlloc

nvidia.ko:   file format elf64-littleaarch64

Disassembly of section .text:

000000000031b100 <_portMemAllocatorAlloc>:
  31b100: a9be7bfd      stp     x29, x30, [sp, #-0x20]!
  31b104: a9014ff4      stp     x20, x19, [sp, #0x10]
  31b108: 910003fd      mov     x29, sp
  31b10c: b4000660      cbz     x0, 0x31b1d8 <_portMemAllocatorAlloc+0xd8>
  31b110: aa0103f3      mov     x19, x1
  31b114: 91007021      add     x1, x1, #0x1c
  31b118: f100743f      cmp     x1, #0x1d
  31b11c: 540006a3      b.lo    0x31b1f0 <_portMemAllocatorAlloc+0xf0>
  31b120: f9400008      ldr     x8, [x0]
  31b124: aa0003f4      mov     x20, x0
  31b128: d63f0100      blr     x8
  31b12c: b40010a0      cbz     x0, 0x31b340 <_portMemAllocatorAlloc+0x240>
  31b130: f9400e89      ldr     x9, [x20, #0x18]
  31b134: aa0003e8      mov     x8, x0
  31b138: 91006000      add     x0, x0, #0x18
  31b13c: b4001029      cbz     x9, 0x31b340 <_portMemAllocatorAlloc+0x240>
  31b140: 9100612b      add     x11, x9, #0x18
  31b144: f9000113      str     x19, [x8]
  31b148: d503201f      nop
  31b14c: d503201f      nop
  31b150: 885ffd6a      ldaxr   w10, [x11]
  31b154: 1100054a      add     w10, w10, #0x1
  31b158: 880cfd6a      stlxr   w12, w10, [x11]
  31b15c: 35ffffac      cbnz    w12, 0x31b150 <_portMemAllocatorAlloc+0x50>
  31b160: 9100712b      add     x11, x9, #0x1c
  31b164: 885ffd6c      ldaxr   w12, [x11]
  31b168: 1100058c      add     w12, w12, #0x1
  31b16c: 880dfd6c      stlxr   w13, w12, [x11]
  31b170: 35ffffad      cbnz    w13, 0x31b164 <_portMemAllocatorAlloc+0x64>
  31b174: 9100a12c      add     x12, x9, #0x28
  31b178: d503201f      nop
  31b17c: d503201f      nop
  31b180: c85ffd8b      ldaxr   x11, [x12]
  31b184: 8b13016b      add     x11, x11, x19
  31b188: c80dfd8b      stlxr   w13, x11, [x12]
  31b18c: 35ffffad      cbnz    w13, 0x31b180 <_portMemAllocatorAlloc+0x80>
  31b190: 9100c12c      add     x12, x9, #0x30
  31b194: c85ffd8d      ldaxr   x13, [x12]
  31b198: 8b1301ad      add     x13, x13, x19
  31b19c: c80efd8d      stlxr   w14, x13, [x12]
  31b1a0: 35ffffae      cbnz    w14, 0x31b194 <_portMemAllocatorAlloc+0x94>
  31b1a4: aa0903ec      mov     x12, x9
  31b1a8: f8438d8d      ldr     x13, [x12, #0x38]!
  31b1ac: 14000003      b       0x31b1b8 <_portMemAllocatorAlloc+0xb8>
  31b1b0: d5033f5f      clrex
  31b1b4: f940018d      ldr     x13, [x12]
  31b1b8: eb0d017f      cmp     x11, x13
  31b1bc: 540001e9      b.ls    0x31b1f8 <_portMemAllocatorAlloc+0xf8>
  31b1c0: c85ffd8e      ldaxr   x14, [x12]
  31b1c4: eb0d01df      cmp     x14, x13
  31b1c8: 54ffff41      b.ne    0x31b1b0 <_portMemAllocatorAlloc+0xb0>
  31b1cc: c80efd8b      stlxr   w14, x11, [x12]
  31b1d0: 35ffff8e      cbnz    w14, 0x31b1c0 <_portMemAllocatorAlloc+0xc0>
  31b1d4: 17fffff8      b       0x31b1b4 <_portMemAllocatorAlloc+0xb4>
  31b1d8: 94000000      bl      0x31b1d8 <_portMemAllocatorAlloc+0xd8>
  31b1dc: 72001c1f      tst     w0, #0xff
  31b1e0: 54000080      b.eq    0x31b1f0 <_portMemAllocatorAlloc+0xf0>
  31b1e4: 94000000      bl      0x31b1e4 <_portMemAllocatorAlloc+0xe4>
  31b1e8: aa1f03e0      mov     x0, xzr
  31b1ec: 14000055      b       0x31b340 <_portMemAllocatorAlloc+0x240>
  31b1f0: aa1f03e0      mov     x0, xzr
  31b1f4: 14000053      b       0x31b340 <_portMemAllocatorAlloc+0x240>
  31b1f8: 9100812d      add     x13, x9, #0x20
  31b1fc: b94001ae      ldr     w14, [x13]
  31b200: f940018f      ldr     x15, [x12]
  31b204: eb0f017f      cmp     x11, x15
  31b208: 54000141      b.ne    0x31b230 <_portMemAllocatorAlloc+0x130>
  31b20c: d503201f      nop
  31b210: 885ffdaf      ldaxr   w15, [x13]
  31b214: 6b0e01ff      cmp     w15, w14
  31b218: 54000081      b.ne    0x31b228 <_portMemAllocatorAlloc+0x128>
  31b21c: 880ffdaa      stlxr   w15, w10, [x13]
  31b220: 35ffff8f      cbnz    w15, 0x31b210 <_portMemAllocatorAlloc+0x110>
  31b224: 14000003      b       0x31b230 <_portMemAllocatorAlloc+0x130>
  31b228: d5033f5f      clrex
  31b22c: 17fffff4      b       0x31b1fc <_portMemAllocatorAlloc+0xfc>
  31b230: 9000000b      adrp    x11, 0x31b000 <portMemShutdown+0x50>
  31b234: 9100016b      add     x11, x11, #0x0
  31b238: 885ffd6a      ldaxr   w10, [x11]
  31b23c: 1100054a      add     w10, w10, #0x1
  31b240: 880cfd6a      stlxr   w12, w10, [x11]
  31b244: 35ffff6c      cbnz    w12, 0x31b230 <_portMemAllocatorAlloc+0x130>
  31b248: 9000000b      adrp    x11, 0x31b000 <portMemShutdown+0x50>
  31b24c: 9100016b      add     x11, x11, #0x0
  31b250: 885ffd6c      ldaxr   w12, [x11]
  31b254: 1100058c      add     w12, w12, #0x1
  31b258: 880dfd6c      stlxr   w13, w12, [x11]
  31b25c: 35ffffad      cbnz    w13, 0x31b250 <_portMemAllocatorAlloc+0x150>
  31b260: 9000000c      adrp    x12, 0x31b000 <portMemShutdown+0x50>
  31b264: 9100018c      add     x12, x12, #0x0
  31b268: d503201f      nop
  31b26c: d503201f      nop
  31b270: c85ffd8b      ldaxr   x11, [x12]
  31b274: 8b13016b      add     x11, x11, x19
  31b278: c80dfd8b      stlxr   w13, x11, [x12]
  31b27c: 35ffffad      cbnz    w13, 0x31b270 <_portMemAllocatorAlloc+0x170>
  31b280: 9000000c      adrp    x12, 0x31b000 <portMemShutdown+0x50>
  31b284: 9100018c      add     x12, x12, #0x0
  31b288: d503201f      nop
  31b28c: d503201f      nop
  31b290: c85ffd8d      ldaxr   x13, [x12]
  31b294: 8b1301ad      add     x13, x13, x19
  31b298: c80efd8d      stlxr   w14, x13, [x12]
  31b29c: 35ffffae      cbnz    w14, 0x31b290 <_portMemAllocatorAlloc+0x190>
  31b2a0: 9000000c      adrp    x12, 0x31b000 <portMemShutdown+0x50>
  31b2a4: f940018d      ldr     x13, [x12]
  31b2a8: eb0d017f      cmp     x11, x13
  31b2ac: 540001e9      b.ls    0x31b2e8 <_portMemAllocatorAlloc+0x1e8>
  31b2b0: 9000000e      adrp    x14, 0x31b000 <portMemShutdown+0x50>
  31b2b4: 910001ce      add     x14, x14, #0x0
  31b2b8: 14000006      b       0x31b2d0 <_portMemAllocatorAlloc+0x1d0>
  31b2bc: d5033f5f      clrex
  31b2c0: f940018d      ldr     x13, [x12]
  31b2c4: eb0d017f      cmp     x11, x13
  31b2c8: 54000109      b.ls    0x31b2e8 <_portMemAllocatorAlloc+0x1e8>
  31b2cc: d503201f      nop
  31b2d0: c85ffdcf      ldaxr   x15, [x14]
  31b2d4: eb0d01ff      cmp     x15, x13
  31b2d8: 54ffff21      b.ne    0x31b2bc <_portMemAllocatorAlloc+0x1bc>
  31b2dc: c80ffdcb      stlxr   w15, x11, [x14]
  31b2e0: 35ffff8f      cbnz    w15, 0x31b2d0 <_portMemAllocatorAlloc+0x1d0>
  31b2e4: 17fffff7      b       0x31b2c0 <_portMemAllocatorAlloc+0x1c0>
  31b2e8: 9000000c      adrp    x12, 0x31b000 <portMemShutdown+0x50>
  31b2ec: 9100018c      add     x12, x12, #0x0
  31b2f0: b940018d      ldr     w13, [x12]
  31b2f4: f9400d8e      ldr     x14, [x12, #0x18]
  31b2f8: eb0e017f      cmp     x11, x14
  31b2fc: 54000121      b.ne    0x31b320 <_portMemAllocatorAlloc+0x220>
  31b300: 885ffd8e      ldaxr   w14, [x12]
  31b304: 6b0d01df      cmp     w14, w13
  31b308: 54000081      b.ne    0x31b318 <_portMemAllocatorAlloc+0x218>
  31b30c: 880efd8a      stlxr   w14, w10, [x12]
  31b310: 35ffff8e      cbnz    w14, 0x31b300 <_portMemAllocatorAlloc+0x200>
  31b314: 14000003      b       0x31b320 <_portMemAllocatorAlloc+0x220>
  31b318: d5033f5f      clrex
  31b31c: 17fffff5      b       0x31b2f0 <_portMemAllocatorAlloc+0x1f0>
  31b320: f9400129      ldr     x9, [x9]
  31b324: 528c2c8a      mov     w10, #0x6164            // =24932
  31b328: 72ad0caa      movk    w10, #0x6865, lsl #16
  31b32c: b900110a      str     w10, [x8, #0x10]
  31b330: f9000509      str     x9, [x8, #0x8]
  31b334: 528d2d88      mov     w8, #0x696c             // =26988
  31b338: 72ae8c28      movk    w8, #0x7461, lsl #16
  31b33c: b8336808      str     w8, [x0, x19]
  31b340: a9414ff4      ldp     x20, x19, [sp, #0x10]
  31b344: a8c27bfd      ldp     x29, x30, [sp], #0x20
  31b348: d65f03c0      ret
  31b34c: 00000000      udf     #265

To Reproduce

We are able to reproduce this crash with setting CONFIG_ARM64_BTI and CONFIG_ARM64_BTI_KERNEL in our v6.6.72 based arm64 kernel, which is cross compiled on a x86_64 host for a arm64 target. This crash is consistent and happens in all available versions of the open-gpu-kernel-modules at this time.

Bug Incidence

Always

nvidia-bug-report.log.gz

The kernel crash prevents us to capture the nvidia-bug-report.log.gz.

More Info

We believe the driver build system is not generating the bti instructions in all the right places. The "src" directory is not compiled with -mbranch-protection=bti, but the "kernel-open" directory is. This may be because "kernel-open" has the Kbuild file in it and uses the kernel's configuration, while "src" doesn't seem to do that. Hardcoding these in the build seems to be generating the bti instructions in all the right places, demonstrating the problem.

--- a/utils.mk    2025-01-30 00:38:50.222119248 -0800
+++ b/utils.mk    2025-01-30 00:38:43.586047595 -0800
@@ -167,7 +167,7 @@ ifeq ($(TARGET_ARCH),armv7l)
 endif

 ifeq ($(TARGET_ARCH),aarch64)
-  CFLAGS += -DNV_AARCH64 -DNV_ARCH_BITS=64
+  CFLAGS += -DNV_AARCH64 -DNV_ARCH_BITS=64 -mbranch-protection=pac-ret+bti -march=armv8.5-a -DARM64_ASM_ARCH='"armv8.5-a"'
 endif

 ifeq ($(TARGET_ARCH),ppc64le) 
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -94,7 +94,6 @@ endif
 
 ifeq ($(TARGET_ARCH),aarch64)
   CFLAGS += -mgeneral-regs-only
-  CFLAGS += -march=armv8-a
   CFLAGS += -ffixed-x18
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
 endif
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -88,7 +88,6 @@ endif
 
 ifeq ($(TARGET_ARCH),aarch64)
   CFLAGS += -mgeneral-regs-only
-  CFLAGS += -march=armv8-a
   CFLAGS += -mstrict-align
   CFLAGS += -ffixed-x18
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
--- a/kernel-open/Kbuild
+++ b/kernel-open/Kbuild
@@ -109,7 +109,7 @@ endif
 EXTRA_CFLAGS += -ffreestanding
 
 ifeq ($(ARCH),arm64)
- EXTRA_CFLAGS += -mgeneral-regs-only -march=armv8-a
+ EXTRA_CFLAGS += -mgeneral-regs-only
  EXTRA_CFLAGS += $(call cc-option,-mno-outline-atomics,)
 endif

Jan 31 '25 00:01 arnav-kansal

@arnav-kansal I believe this issue is along the same lines as an issue that was opened for CFI violations. @aritger has a good write-up on our build system design and what users can do to work with it.

https://github.com/NVIDIA/open-gpu-kernel-modules/issues/439#issuecomment-1378122108

Quoting the comment directly.

Thanks for the feedback. That is a fair critique.

For whatever it is worth, there are a few motivations for the split:

(1) Historically, the non-kbuild part (the part that produces nv-kernel.o) was built internally to NVIDIA and is what was distributed as binary-only. Code not built for a specific target kernel cannot use kbuild.

(2) With the advent of open-gpu-kernel-modules, we chose to retain that split so that users installing the driver wouldn't be required to build all of the kernel module when installing the driver. I.e., installing the driver from the NVIDIA .run file contains a pre-built open-gpu-kernel-modules nv-kernel.o. We can only do that because nv-kernel.o is not kernel-specific. Currently, open-gpu-kernel-modules takes about 10 minutes to build if single threaded. Much of that can be covered with a parallel build, but we didn't want to add that install time for every user installing from .run file if we didn't need to.

The big disadvantage of the split is of course that you need to match these sorts of compiler flags across the split if doing instrumentation like RAP.

Maybe the benefits of (2) are outweighed by the downsides and we should revisit that decision.

That is at least the context. So, I don't know if we can immediately move to an all kbuild-native build.

The nv_encode_caching() bug is a good catch. Thanks for that. Does nv-mmap.c not include nv-proto.h? If not, that is a bug, too. Even with the current split, I would expect the compiler to complain if the prototype and implementation mismatch.

For the near-term, would it be acceptable to pass these additional CFLAGS on the make commandline? Maybe the makefiles need more variable plumbing to facilitate that. But, I think it will be easiest to get traction with something like that, than require kbuild-ifying the entirety of the open-gpu-kernel-modules build. The code changes for that wouldn't be difficult, but the hard part would be the packaging/installation implications of that choice.

Feb 01 '25 20:02 Binary-Eater