open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Nvidia driver fails to build on Linux kernels with CONFIG_DEBUG_INFO_BTF_MODULES

Open ryao opened this issue 1 year ago • 11 comments

NVIDIA Open GPU Kernel Modules Version

565.77

Operating System and Version

Gentoo Linux

Kernel Release

6.12.6-gentoo-x86_64

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Build Command

emerge x11-drivers/nvidia-drivers

Terminal output/Build Log

build.log

More Info

That build used MAKEOPTS=-j16, so the command that caused the failure is not obvious. The command that fails is this:

# LLVM_OBJCOPY="/usr/bin/x86_64-pc-linux-gnu-objcopy" pahole -J -j --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs --lang_exclude=rust --btf_features=distilled_base --btf_base vmlinux /var/tmp/portage/x11-drivers/nvidia-drivers-565.77/work/kernel-module-source/kernel-open/nvidia-modeset.ko
dwarf_expr: unhandled 0x12 DW_OP_ operation
die__process_function: tag not supported 0x2f (template_type_parameter)!
dwarf_expr: unhandled 0x12 DW_OP_ operation
Unsupported DW_TAG_reference_type(0x10): type: 0x5fb
Encountered error while encoding BTF.

This appears to be related to #149. The reporter there claimed using !buildflags on Arch avoided the problem, although I am not sure how that is possible because the kernel build system does not use regular build flags.

In any case, there is something in the module's DWARF information that pahole does not like. Upgrading to pahole 1.28, which is the latest version, does not resolve the issue.

ryao avatar Dec 26 '24 04:12 ryao

Renaming my /usr/src/linux/vmlinux binary will allow the build to complete, but this is a hack, as it prevents the BTF debuginfo from being generated.

ryao avatar Dec 26 '24 05:12 ryao

It probably should be noted that reproduction likely requires building against a locally built kernel. I assume that a packaged one won't include the vmlinux binary in /usr/src/linux.

ryao avatar Dec 26 '24 06:12 ryao

I made an internal fix for this that has not made its way into a public release. With that fix, you will not need to move vmlinux, and BTF will generate.

Binary-Eater avatar Dec 30 '24 22:12 Binary-Eater

Here is a commit with our internal fix for this issue: https://github.com/Binary-Eater/open-gpu-kernel-modules/commit/854449a7b76cdb4ad17919a1c8a662a4ff5b943d

Binary-Eater avatar Jan 03 '25 21:01 Binary-Eater

Upgrading to pahole 1.28, which is the latest version, does not resolve the issue.

This issue is faced only on newer pahole versions. Older pahole versions do not generate BTF for these modules at all, so an update is unlikely to solve.

As the reporter is using a locally built kernel they can skip BTF generation in the meanwhile as well

diff --git a/scripts/Makefile.modfinal b/scripts/Makefile.modfinal
--- a/scripts/Makefile.modfinal
+++ b/scripts/Makefile.modfinal
@@ -41,6 +41,8 @@
       cmd_btf_ko = 							\
 	if [ ! -f vmlinux ]; then					\
 		printf "Skipping BTF generation for %s due to unavailability of vmlinux\n" $@ 1>&2; \
+	elif echo $@ | grep -q "nvidia"; then \
+		printf "Skipping BTF generation for %s because it's an Nvidia module (C++)\n" $@ 1>&2; \
 	else								\
 		LLVM_OBJCOPY="$(OBJCOPY)" $(PAHOLE) -J $(PAHOLE_FLAGS) --btf_base vmlinux $@; \
 		$(RESOLVE_BTFIDS) -b vmlinux $@; 			\

arnav-kansal avatar Jan 31 '25 01:01 arnav-kansal

Here is a commit with our internal fix for this issue: Binary-Eater@854449a

Thanks again for this. Unfortunately, I did not get around to applying this today when I updated to 570.86.16. Without this patch, the build process will hang. With this patch, the build process will also hang. It also does not apply cleanly to 565.77. This issue made troubleshooting #773 "interesting".

ryao avatar Feb 01 '25 23:02 ryao

In Debian I'm also experiencing hanging module builds. It's pahole (1.29) somehow deadlocking. Debian bug report for hanging pahole: https://bugs.debian.org/1100503

Adding --lang_exclude=c++ did not make any difference for me, but Binary-Eater's patch allowed me to append -j1 to the command line, cancelling preceeding -j options and forcing pahole to run sequentially.

anbe42 avatar Mar 14 '25 14:03 anbe42

Seems like I need --lang_exclude=c++11 [1] for the Debian kernel which is compiled with GCC 14. So maybe append both ,c++,c++11 to cover more compiler versions and distributions.

[1] https://lore.kernel.org/dwarves/Z-JzFrXaopQCYd6h@localhost/T/#m7d3a6baed86ac6def78ee45a0d554d4487f84305

anbe42 avatar Apr 01 '25 12:04 anbe42

@anbe42 I am having the exact same issue as you with Debian unstable but I'm afraid I am not quite as capable as you are :-) can you please provide the full path of the makefile you changed, as well as the diff please? Thanks.

mcraveiro avatar Apr 05 '25 21:04 mcraveiro

Actually, managed to get the "rename vmlinux" solution to work for me on Debian unstable. For posterity, here's what I did:

# find . -type f -name vmlinux
./sys/kernel/btf/vmlinux
./usr/src/linux-headers-6.12.16-amd64/vmlinux

I ignored the sys vmlinux and renamed the one under source:

# cd /usr/src/linux-headers-6.12.16-amd64/
# mv vmlinux vmlinux.old
# dpkg --configure -a 

Build worked after that.

mcraveiro avatar Apr 05 '25 22:04 mcraveiro

@anbe42 I am having the exact same issue as you with Debian unstable but I'm afraid I am not quite as capable as you are :-) can you please provide the full path of the makefile you changed, as well as the diff please? Thanks.

https://salsa.debian.org/nvidia-team/nvidia-open-gpu-kernel-modules/-/blob/main/debian/patches/module/0062-Support-BTF-generation-for-non-release-builds.patch?ref_type=heads

anbe42 avatar Apr 09 '25 16:04 anbe42

@Binary-Eater This is still an issue as of 575.57.08. Did your patch never get merged internally?

ryao avatar Jun 07 '25 19:06 ryao

@ryao the patches are present in that release. Are you using the runfile or the repository for building the modules?

open-gpu-kernel-modules on  HEAD (30e15d7) [?] 
❯ rg pahole
kernel-open/Makefile
79:  # propagating pahole's return status (with 'exit system(pahole_cmd)'), to
84:  #     pahole_cmd = "pahole"
87:  #             pahole_cmd = pahole_cmd sprintf(" %s,c++", ARGV[i])
89:  #             pahole_cmd = pahole_cmd sprintf(" %s", ARGV[i])
92:  #     system(pahole_cmd)
94:  PAHOLE_AWK_PROGRAM = BEGIN { pahole_cmd = \"pahole\"; for (i = 1; i < ARGC; i++) { if (ARGV[i] ~ /--lang_exclude=/) { pahole_cmd = pahole_cmd sprintf(\" %s,c++\", ARGV[i]); } else { pahole_cmd = pahole_cmd sprintf(\" %s\", ARGV[i]); } } system(pahole_cmd); }
95:  # If scripts/pahole-flags.sh is not present in the kernel tree, add PAHOLE and
98:  PAHOLE_VARIABLES=$(if $(wildcard $(KERNEL_SOURCES)/scripts/pahole-flags.sh),,"PAHOLE=$(AWK) '$(PAHOLE_AWK_PROGRAM)'")

If using the repository, can you also confirm what commit you are using?

commit 30e15d79de62e8955eb8b77a6e292ab9b87f52b0 (HEAD, tag: 575.57.08, nvidia/main)
Author: Maneet Singh <[email protected]>
Date:   Thu May 29 10:58:21 2025 -0700

    575.57.08

The issue is resolved for me on Linux kernel 6.12.10-arch1-1, gcc version 14.2.1 20240910 (GCC), and pahole v1.28 on Arch Linux using the repository.

  LD [M]  /home/binary-eater/Documents/open-gpu-kernel-modules/kernel-open/nvidia-drm.ko
  BTF [M] /home/binary-eater/Documents/open-gpu-kernel-modules/kernel-open/nvidia-peermem.ko
  BTF [M] /home/binary-eater/Documents/open-gpu-kernel-modules/kernel-open/nvidia-drm.ko
  BTF [M] /home/binary-eater/Documents/open-gpu-kernel-modules/kernel-open/nvidia-modeset.ko
  BTF [M] /home/binary-eater/Documents/open-gpu-kernel-modules/kernel-open/nvidia.ko
  BTF [M] /home/binary-eater/Documents/open-gpu-kernel-modules/kernel-open/nvidia-uvm.ko
make[2]: Leaving directory '/usr/lib/modules/6.12.10-arch1-1/build'
make[1]: Leaving directory '/home/binary-eater/Documents/open-gpu-kernel-modules/kernel-open'
[binary-eater@BINARY-EATER-TEST open-gpu-kernel-modules]$ 

Binary-Eater avatar Jun 07 '25 20:06 Binary-Eater

Seems like I need --lang_exclude=c++11 [1] for the Debian kernel which is compiled with GCC 14. So maybe append both ,c++,c++11 to cover more compiler versions and distributions.

[1] https://lore.kernel.org/dwarves/Z-JzFrXaopQCYd6h@localhost/T/#m7d3a6baed86ac6def78ee45a0d554d4487f84305

Thanks @anbe42 for the feedback. Feel free to mention me if you would like me to take it back up in our implementation. Are you saying you needed c++ and c++11 for --lang_exclude? That makes sense since DW_LANG_C_plus_plus and DW_LANG_C_plus_plus_11 are discrete (refer to https://dwarfstd.org/languages.html). In our internal builds, we don't run into this issue with gcc. I also have not seen this problem with Arch Linux. That said, I assume the dwarf generation can depend heavily on how a distribution is choosing to build NVIDIA/open-gpu-kernel-modules from source. I'll go ahead and make an update to be more verbose with --lang_exclude.

Binary-Eater avatar Jun 07 '25 20:06 Binary-Eater

@Binary-Eater It probably depends on the distribution which -std=c++XX is used by default for C++ compilation. So you only seem to get DW_LANG_C_plus_plus while on Debian (which IIRC defaults to C++20 on GCC 14 (and post-trixie will switch the default to C++23 on GCC 15)) I see DW_LANG_C_plus_plus_11 instead. If my memory is correct I only needed to exclude c++11 to make pahole succeed on Debian, but my suggestion would be to --lang_exclude both to be more distribution agnostic.

anbe42 avatar Jun 07 '25 23:06 anbe42

@Binary-Eater It turns out that this is no longer reproducible on my machine. The issue appears to be fixed. My apologies for the noise. I am closing this as fixed.

For the long version, I encountered this when updating many packages on my machine while building a new kernel at the same time. I suspect that the genkernel build scripts had attempted to rebuild an older driver that was missing your patch, rather than than the latest driver that I had expected it to build. That caused me to hit his again, yet think that I hit it on the latest driver.

ryao avatar Jun 10 '25 00:06 ryao