NVIDIA GPU kernel modules fail to load on ARM kernels with BTI enabled due to missing BTI instructions in the compiled module.
NVIDIA Open GPU Kernel Modules Version
all
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [x] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Container-Optimized OS
Kernel Release
6.6.72
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
B200
Describe the bug
When attempting to load the NVIDIA driver on a BTI-enabled ARM64 kernel, a kernel panic occurs with the following call trace:
[ 164.260150] LoadPin: kernel-module pinning-excluded obj="/var/lib/nvidia/drivers/nvidia.ko" pid=780 cmdline="insmod /var/lib/nvidia/drivers/nvidia.ko"
[ 165.050761] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 165.058515] Internal error: Oops - BTI: 0000000036000002 [#1] SMP
[ 165.066266] Modules linked in: nvidia(O+) nft_compat nf_tables mlx5_ib ib_uverbs ib_core mlx5_core mlxfw ptp pps_core loadpin_trigger(O) fuse configfs
[ 165.079464] CPU: 1 PID: 780 Comm: insmod Tainted: G O 6.6.72+
#1
[ 165.086622] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 165.095352] pstate: 23400805 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=-c)
[ 165.102149] pc : _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]
[ 165.108885] lr : _portMemAllocatorAlloc+0x2c/0x250 [nvidia]
[ 165.114484] sp : ffff800080abb7e0
[ 165.117723] x29: ffff800080abb7e0 x28: ffff80008bc9af88 x27: 0000000000000000
[ 165.124701] x26: ffffaececa6c52a8 x25: ffffaece94370188 x24: ffffaececa58c008
[ 165.131521] x23: ffffaececa5b9ad0 x22: ffffaece946a5000 x21: ffff0000c333f000
[ 165.138327] x20: ffffaece946cb7e0 x19: 0000000000000010 x18: ffff8000805e7058
[ 165.145162] x17: 72643d4d45545359 x16: ffffaecec8fa58a8 x15: 0000000000000000
[ 165.152147] x14: 0000000000000000 x13: 0000000000800000 x12: 000d272730000000
[ 165.159075] x11: 0000000000000017 x10: 00000000000f4240 x9 : 000000000006c00d
[ 165.165833] x8 : ffffaece94690b70 x7 : 4150564544006464 x6 : 0000000000000018
[ 165.172672] x5 : 0000000000000000 x4 : ffff800080abb720 x3 : 0000000000000001
[ 165.179452] x2 : 0000000000000000 x1 : 000000000000002c x0 : ffffaece946cb7e0
[ 165.186297] Call trace:
[ 165.188626] _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]
[ 165.194827] portCryptoInitialize+0x2c/0xc0 [nvidia]
[ 165.199843] portInitialize+0x30/0x40 [nvidia]
[ 165.204288] RmInitRm+0x14/0x80 [nvidia]
[ 165.208243] rm_init_rm+0xc/0x20 [nvidia]
[ 165.212325] nv_module_init+0x8c/0xe8 [nvidia]
[ 165.216861] init_module+0xfc/0x2c0 [nvidia]
[ 165.221121] do_one_initcall+0x134/0x2b8
[ 165.225058] do_init_module+0x64/0x208
[ 165.228648] load_module+0x1184/0x1300
[ 165.232227] __arm64_sys_finit_module+0x238/0x370
[ 165.236748] invoke_syscall+0x5c/0x128
[ 165.240324] el0_svc_common+0x90/0xf8
[ 165.243811] do_el0_svc+0x2c/0x48
[ 165.247049] el0_svc+0x40/0x98
[ 165.250002] el0t_64_sync_handler+0x8c/0x108
[ 165.254149] el0t_64_sync+0x198/0x1a0
[ 165.257693] Code: aa0103e0 140003db 00000000 00000000 (aa0103e0)
[ 165.263492] ---[ end trace 0000000000000000 ]---
[ 165.322068] Kernel panic - not syncing: Oops - BTI: Fatal exception
[ 165.328507] SMP: stopping secondary CPUs
[ 165.332519] Kernel Offset: 0x2ece48e90000 from 0xffff800080000000
[ 165.338348] PHYS_OFFSET: 0x40000000
[ 165.341679] CPU features: 0x0,00000000,50024c43,15267ea7
[ 165.346968] Memory Limit: none
The crash occurs within the _portMemAllocatorAllocNonPagedWrapper function. Disassembly of this function shows the lack of BTI instructions where expected -
$ llvm-objdump nvidia.ko --disassemble-symbols=_portMemAllocatorAllocNonPagedWrapper
nvidia.ko: file format elf64-littleaarch64
Disassembly of section .text:
000000000031af90 <_portMemAllocatorAllocNonPagedWrapper>:
31af90: aa0103e0 mov x0, x1
31af94: 14000000 b 0x31af94 <_portMemAllocatorAllocNonPagedWrapper+0x4>
...
$ llvm-objdump nvidia.ko --disassemble-symbols=_portMemAllocatorAlloc
nvidia.ko: file format elf64-littleaarch64
Disassembly of section .text:
000000000031b100 <_portMemAllocatorAlloc>:
31b100: a9be7bfd stp x29, x30, [sp, #-0x20]!
31b104: a9014ff4 stp x20, x19, [sp, #0x10]
31b108: 910003fd mov x29, sp
31b10c: b4000660 cbz x0, 0x31b1d8 <_portMemAllocatorAlloc+0xd8>
31b110: aa0103f3 mov x19, x1
31b114: 91007021 add x1, x1, #0x1c
31b118: f100743f cmp x1, #0x1d
31b11c: 540006a3 b.lo 0x31b1f0 <_portMemAllocatorAlloc+0xf0>
31b120: f9400008 ldr x8, [x0]
31b124: aa0003f4 mov x20, x0
31b128: d63f0100 blr x8
31b12c: b40010a0 cbz x0, 0x31b340 <_portMemAllocatorAlloc+0x240>
31b130: f9400e89 ldr x9, [x20, #0x18]
31b134: aa0003e8 mov x8, x0
31b138: 91006000 add x0, x0, #0x18
31b13c: b4001029 cbz x9, 0x31b340 <_portMemAllocatorAlloc+0x240>
31b140: 9100612b add x11, x9, #0x18
31b144: f9000113 str x19, [x8]
31b148: d503201f nop
31b14c: d503201f nop
31b150: 885ffd6a ldaxr w10, [x11]
31b154: 1100054a add w10, w10, #0x1
31b158: 880cfd6a stlxr w12, w10, [x11]
31b15c: 35ffffac cbnz w12, 0x31b150 <_portMemAllocatorAlloc+0x50>
31b160: 9100712b add x11, x9, #0x1c
31b164: 885ffd6c ldaxr w12, [x11]
31b168: 1100058c add w12, w12, #0x1
31b16c: 880dfd6c stlxr w13, w12, [x11]
31b170: 35ffffad cbnz w13, 0x31b164 <_portMemAllocatorAlloc+0x64>
31b174: 9100a12c add x12, x9, #0x28
31b178: d503201f nop
31b17c: d503201f nop
31b180: c85ffd8b ldaxr x11, [x12]
31b184: 8b13016b add x11, x11, x19
31b188: c80dfd8b stlxr w13, x11, [x12]
31b18c: 35ffffad cbnz w13, 0x31b180 <_portMemAllocatorAlloc+0x80>
31b190: 9100c12c add x12, x9, #0x30
31b194: c85ffd8d ldaxr x13, [x12]
31b198: 8b1301ad add x13, x13, x19
31b19c: c80efd8d stlxr w14, x13, [x12]
31b1a0: 35ffffae cbnz w14, 0x31b194 <_portMemAllocatorAlloc+0x94>
31b1a4: aa0903ec mov x12, x9
31b1a8: f8438d8d ldr x13, [x12, #0x38]!
31b1ac: 14000003 b 0x31b1b8 <_portMemAllocatorAlloc+0xb8>
31b1b0: d5033f5f clrex
31b1b4: f940018d ldr x13, [x12]
31b1b8: eb0d017f cmp x11, x13
31b1bc: 540001e9 b.ls 0x31b1f8 <_portMemAllocatorAlloc+0xf8>
31b1c0: c85ffd8e ldaxr x14, [x12]
31b1c4: eb0d01df cmp x14, x13
31b1c8: 54ffff41 b.ne 0x31b1b0 <_portMemAllocatorAlloc+0xb0>
31b1cc: c80efd8b stlxr w14, x11, [x12]
31b1d0: 35ffff8e cbnz w14, 0x31b1c0 <_portMemAllocatorAlloc+0xc0>
31b1d4: 17fffff8 b 0x31b1b4 <_portMemAllocatorAlloc+0xb4>
31b1d8: 94000000 bl 0x31b1d8 <_portMemAllocatorAlloc+0xd8>
31b1dc: 72001c1f tst w0, #0xff
31b1e0: 54000080 b.eq 0x31b1f0 <_portMemAllocatorAlloc+0xf0>
31b1e4: 94000000 bl 0x31b1e4 <_portMemAllocatorAlloc+0xe4>
31b1e8: aa1f03e0 mov x0, xzr
31b1ec: 14000055 b 0x31b340 <_portMemAllocatorAlloc+0x240>
31b1f0: aa1f03e0 mov x0, xzr
31b1f4: 14000053 b 0x31b340 <_portMemAllocatorAlloc+0x240>
31b1f8: 9100812d add x13, x9, #0x20
31b1fc: b94001ae ldr w14, [x13]
31b200: f940018f ldr x15, [x12]
31b204: eb0f017f cmp x11, x15
31b208: 54000141 b.ne 0x31b230 <_portMemAllocatorAlloc+0x130>
31b20c: d503201f nop
31b210: 885ffdaf ldaxr w15, [x13]
31b214: 6b0e01ff cmp w15, w14
31b218: 54000081 b.ne 0x31b228 <_portMemAllocatorAlloc+0x128>
31b21c: 880ffdaa stlxr w15, w10, [x13]
31b220: 35ffff8f cbnz w15, 0x31b210 <_portMemAllocatorAlloc+0x110>
31b224: 14000003 b 0x31b230 <_portMemAllocatorAlloc+0x130>
31b228: d5033f5f clrex
31b22c: 17fffff4 b 0x31b1fc <_portMemAllocatorAlloc+0xfc>
31b230: 9000000b adrp x11, 0x31b000 <portMemShutdown+0x50>
31b234: 9100016b add x11, x11, #0x0
31b238: 885ffd6a ldaxr w10, [x11]
31b23c: 1100054a add w10, w10, #0x1
31b240: 880cfd6a stlxr w12, w10, [x11]
31b244: 35ffff6c cbnz w12, 0x31b230 <_portMemAllocatorAlloc+0x130>
31b248: 9000000b adrp x11, 0x31b000 <portMemShutdown+0x50>
31b24c: 9100016b add x11, x11, #0x0
31b250: 885ffd6c ldaxr w12, [x11]
31b254: 1100058c add w12, w12, #0x1
31b258: 880dfd6c stlxr w13, w12, [x11]
31b25c: 35ffffad cbnz w13, 0x31b250 <_portMemAllocatorAlloc+0x150>
31b260: 9000000c adrp x12, 0x31b000 <portMemShutdown+0x50>
31b264: 9100018c add x12, x12, #0x0
31b268: d503201f nop
31b26c: d503201f nop
31b270: c85ffd8b ldaxr x11, [x12]
31b274: 8b13016b add x11, x11, x19
31b278: c80dfd8b stlxr w13, x11, [x12]
31b27c: 35ffffad cbnz w13, 0x31b270 <_portMemAllocatorAlloc+0x170>
31b280: 9000000c adrp x12, 0x31b000 <portMemShutdown+0x50>
31b284: 9100018c add x12, x12, #0x0
31b288: d503201f nop
31b28c: d503201f nop
31b290: c85ffd8d ldaxr x13, [x12]
31b294: 8b1301ad add x13, x13, x19
31b298: c80efd8d stlxr w14, x13, [x12]
31b29c: 35ffffae cbnz w14, 0x31b290 <_portMemAllocatorAlloc+0x190>
31b2a0: 9000000c adrp x12, 0x31b000 <portMemShutdown+0x50>
31b2a4: f940018d ldr x13, [x12]
31b2a8: eb0d017f cmp x11, x13
31b2ac: 540001e9 b.ls 0x31b2e8 <_portMemAllocatorAlloc+0x1e8>
31b2b0: 9000000e adrp x14, 0x31b000 <portMemShutdown+0x50>
31b2b4: 910001ce add x14, x14, #0x0
31b2b8: 14000006 b 0x31b2d0 <_portMemAllocatorAlloc+0x1d0>
31b2bc: d5033f5f clrex
31b2c0: f940018d ldr x13, [x12]
31b2c4: eb0d017f cmp x11, x13
31b2c8: 54000109 b.ls 0x31b2e8 <_portMemAllocatorAlloc+0x1e8>
31b2cc: d503201f nop
31b2d0: c85ffdcf ldaxr x15, [x14]
31b2d4: eb0d01ff cmp x15, x13
31b2d8: 54ffff21 b.ne 0x31b2bc <_portMemAllocatorAlloc+0x1bc>
31b2dc: c80ffdcb stlxr w15, x11, [x14]
31b2e0: 35ffff8f cbnz w15, 0x31b2d0 <_portMemAllocatorAlloc+0x1d0>
31b2e4: 17fffff7 b 0x31b2c0 <_portMemAllocatorAlloc+0x1c0>
31b2e8: 9000000c adrp x12, 0x31b000 <portMemShutdown+0x50>
31b2ec: 9100018c add x12, x12, #0x0
31b2f0: b940018d ldr w13, [x12]
31b2f4: f9400d8e ldr x14, [x12, #0x18]
31b2f8: eb0e017f cmp x11, x14
31b2fc: 54000121 b.ne 0x31b320 <_portMemAllocatorAlloc+0x220>
31b300: 885ffd8e ldaxr w14, [x12]
31b304: 6b0d01df cmp w14, w13
31b308: 54000081 b.ne 0x31b318 <_portMemAllocatorAlloc+0x218>
31b30c: 880efd8a stlxr w14, w10, [x12]
31b310: 35ffff8e cbnz w14, 0x31b300 <_portMemAllocatorAlloc+0x200>
31b314: 14000003 b 0x31b320 <_portMemAllocatorAlloc+0x220>
31b318: d5033f5f clrex
31b31c: 17fffff5 b 0x31b2f0 <_portMemAllocatorAlloc+0x1f0>
31b320: f9400129 ldr x9, [x9]
31b324: 528c2c8a mov w10, #0x6164 // =24932
31b328: 72ad0caa movk w10, #0x6865, lsl #16
31b32c: b900110a str w10, [x8, #0x10]
31b330: f9000509 str x9, [x8, #0x8]
31b334: 528d2d88 mov w8, #0x696c // =26988
31b338: 72ae8c28 movk w8, #0x7461, lsl #16
31b33c: b8336808 str w8, [x0, x19]
31b340: a9414ff4 ldp x20, x19, [sp, #0x10]
31b344: a8c27bfd ldp x29, x30, [sp], #0x20
31b348: d65f03c0 ret
31b34c: 00000000 udf #265
To Reproduce
We are able to reproduce this crash with setting CONFIG_ARM64_BTI and CONFIG_ARM64_BTI_KERNEL in our v6.6.72 based arm64 kernel, which is cross compiled on a x86_64 host for a arm64 target. This crash is consistent and happens in all available versions of the open-gpu-kernel-modules at this time.
Bug Incidence
Always
nvidia-bug-report.log.gz
The kernel crash prevents us to capture the nvidia-bug-report.log.gz.
More Info
We believe the driver build system is not generating the bti instructions in all the right places. The "src" directory is not compiled with -mbranch-protection=bti, but the "kernel-open" directory is. This may be because "kernel-open" has the Kbuild file in it and uses the kernel's configuration, while "src" doesn't seem to do that. Hardcoding these in the build seems to be generating the bti instructions in all the right places, demonstrating the problem.
--- a/utils.mk 2025-01-30 00:38:50.222119248 -0800
+++ b/utils.mk 2025-01-30 00:38:43.586047595 -0800
@@ -167,7 +167,7 @@ ifeq ($(TARGET_ARCH),armv7l)
endif
ifeq ($(TARGET_ARCH),aarch64)
- CFLAGS += -DNV_AARCH64 -DNV_ARCH_BITS=64
+ CFLAGS += -DNV_AARCH64 -DNV_ARCH_BITS=64 -mbranch-protection=pac-ret+bti -march=armv8.5-a -DARM64_ASM_ARCH='"armv8.5-a"'
endif
ifeq ($(TARGET_ARCH),ppc64le)
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -94,7 +94,6 @@ endif
ifeq ($(TARGET_ARCH),aarch64)
CFLAGS += -mgeneral-regs-only
- CFLAGS += -march=armv8-a
CFLAGS += -ffixed-x18
CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
endif
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -88,7 +88,6 @@ endif
ifeq ($(TARGET_ARCH),aarch64)
CFLAGS += -mgeneral-regs-only
- CFLAGS += -march=armv8-a
CFLAGS += -mstrict-align
CFLAGS += -ffixed-x18
CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
--- a/kernel-open/Kbuild
+++ b/kernel-open/Kbuild
@@ -109,7 +109,7 @@ endif
EXTRA_CFLAGS += -ffreestanding
ifeq ($(ARCH),arm64)
- EXTRA_CFLAGS += -mgeneral-regs-only -march=armv8-a
+ EXTRA_CFLAGS += -mgeneral-regs-only
EXTRA_CFLAGS += $(call cc-option,-mno-outline-atomics,)
endif
@arnav-kansal I believe this issue is along the same lines as an issue that was opened for CFI violations. @aritger has a good write-up on our build system design and what users can do to work with it.
https://github.com/NVIDIA/open-gpu-kernel-modules/issues/439#issuecomment-1378122108
Quoting the comment directly.
Thanks for the feedback. That is a fair critique.
For whatever it is worth, there are a few motivations for the split:
(1) Historically, the non-kbuild part (the part that produces nv-kernel.o) was built internally to NVIDIA and is what was distributed as binary-only. Code not built for a specific target kernel cannot use kbuild.
(2) With the advent of open-gpu-kernel-modules, we chose to retain that split so that users installing the driver wouldn't be required to build all of the kernel module when installing the driver. I.e., installing the driver from the NVIDIA .run file contains a pre-built open-gpu-kernel-modules nv-kernel.o. We can only do that because nv-kernel.o is not kernel-specific. Currently, open-gpu-kernel-modules takes about 10 minutes to build if single threaded. Much of that can be covered with a parallel build, but we didn't want to add that install time for every user installing from .run file if we didn't need to.
The big disadvantage of the split is of course that you need to match these sorts of compiler flags across the split if doing instrumentation like RAP.
Maybe the benefits of (2) are outweighed by the downsides and we should revisit that decision.
That is at least the context. So, I don't know if we can immediately move to an all kbuild-native build.
The nv_encode_caching() bug is a good catch. Thanks for that. Does nv-mmap.c not include nv-proto.h? If not, that is a bug, too. Even with the current split, I would expect the compiler to complain if the prototype and implementation mismatch.
For the near-term, would it be acceptable to pass these additional CFLAGS on the make commandline? Maybe the makefiles need more variable plumbing to facilitate that. But, I think it will be easiest to get traction with something like that, than require kbuild-ifying the entirety of the open-gpu-kernel-modules build. The code changes for that wouldn't be difficult, but the hard part would be the packaging/installation implications of that choice.