dkms icon indicating copy to clipboard operation
dkms copied to clipboard

dkms: fix building nvidia open kernel modules against clang and thin/full lto compiled kernel

Open ltsdw opened this issue 1 year ago • 12 comments

Right now there's some issues when trying to compile open-gpu-kernel-modules against a kernel compiled using Clang and Thin/Full LTO.

Also the script will pick the wrong version of the strip command otherwise the module will fail to compile successfully:

strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
...

Is also necessary to set the OBJCOPY environment variable, otherwise even though the modules will compile the system will refuse to boot on it.

Fixes #416.

ltsdw avatar May 21 '24 01:05 ltsdw

Massive thanks for the fix. Can I trouble you for a simple test?

Say, for any/all of the Archlinux targets:

  • fetch and install the custom clang/lto kernel
  • run the tests

evelikov avatar Jun 04 '24 19:06 evelikov

Thank you @evelikov .

Besides all the external modules I have compiling and installing normally on my system. I ran the available dkms tests, this were the results:

Using kernel 6.9.1-2/x86_64
Checking module compression ...
config: CONFIG_MODULE_COMPRESS_ZSTD=y
files: /lib/modules/6.9.1-2/kernel/arch/x86/crypto/aesni-intel.ko.zst
Expected extension: .zst
Preparing a clean test environment
Test framework file hijacking
Adding the test module by version (expected error)
Adding the test module by directory
Adding the test module again (expected error)
Adding the test module by version (expected error)
Building the test module
--- test_cmd_expected_output.log	2024-06-04 16:42:18.060006565 -0300
+++ test_cmd_output.log	2024-06-04 16:42:19.721350689 -0300
@@ -1,6 +1,6 @@

 Building module:
 Cleaning build area...
-Building module(s)...
+make -j1 KERNELRELEASE=6.9.1-2 -C /usr/lib/modules/6.9.1-2/build M=/var/lib/dkms/dkms_test/1.0/build CC=clang OBJCOPY=llvm-objcopy LD=ld.lld...
 Signing module /var/lib/dkms/dkms_test/1.0/build/dkms_test.ko
 Cleaning build area...
Error: unexpected output from: dkms build -k 6.9.1-2 -m dkms_test -v 1.0

It builds and install the dkms_test normally but differs only by outputting the command used to build the module.

All seems fine.

ltsdw avatar Jun 04 '24 19:06 ltsdw

Something is causing the exact build command to be printed, instead of the pretty message. This should only happen when the verbose flag is set. Either by:

  • calling "dkms --verbose ...", or
  • having "verbose" set in the /etc/dkms/framework* files (check also subfolders)
  • ... we might also be inheriting the existing shell env. bits - perhaps "unset verbose" beforehand, just in case?

I'm leaning that there's a config file enabling this.

Alternatively... does the test suit work on your end, with normal (non clang/lto) kernel?

evelikov avatar Jun 04 '24 20:06 evelikov

Alternatively... does the test suit work on your end, with normal (non clang/lto) kernel?

Nope, so I don't think it's related to the setup to build the kernel.

...
Building the test module
--- test_cmd_expected_output.log	2024-06-04 17:23:35.389147611 -0300
+++ test_cmd_output.log	2024-06-04 17:23:37.280020621 -0300
@@ -1,6 +1,6 @@
 
 Building module:
 Cleaning build area...
-Building module(s)...
+make -j1 KERNELRELEASE=6.9.1-arch1-2 -C /usr/lib/modules/6.9.1-arch1-2/build M=/var/lib/dkms/dkms_test/1.0/build...
 Signing module /var/lib/dkms/dkms_test/1.0/build/dkms_test.ko
 Cleaning build area...
Error: unexpected output from: dkms build -k 6.9.1-arch1-2 -m dkms_test -v 1.0

Checking /etc/dkms/framework:

# Verbosity setting, will be active if set to a non-null value:
# verbose=""

Actually I've set the verbosity /etc/dkms/framework just for testing, and it becomes way more verbose:

...
Building the test module
--- test_cmd_expected_output.log	2024-06-04 17:37:15.738406354 -0300
+++ test_cmd_output.log	2024-06-04 17:37:17.378414527 -0300
@@ -1,6 +1,14 @@
 
 Building module:
-Cleaning build area...
-Building module(s)...
+make -C /usr/lib/modules/6.9.1-arch1-2/build M=/var/lib/dkms/dkms_test/1.0/build clean
+make: Entering directory '/usr/lib/modules/6.9.1-arch1-2/build'
+make: Leaving directory '/usr/lib/modules/6.9.1-arch1-2/build'
+
+{ make -j1 KERNELRELEASE=6.9.1-arch1-2 -C /usr/lib/modules/6.9.1-arch1-2/build M=/var/lib/dkms/dkms_test/1.0/build; } >> /var/lib/dkms/dkms_test/1.0/build/make.log 2>&1
+
 Signing module /var/lib/dkms/dkms_test/1.0/build/dkms_test.ko
-Cleaning build area...
+make -C /usr/lib/modules/6.9.1-arch1-2/build M=/var/lib/dkms/dkms_test/1.0/build clean
+make: Entering directory '/usr/lib/modules/6.9.1-arch1-2/build'
+  CLEAN   /var/lib/dkms/dkms_test/1.0/build/Module.symvers
+make: Leaving directory '/usr/lib/modules/6.9.1-arch1-2/build'
+
Error: unexpected output from: dkms build -k 6.9.1-arch1-2 -m dkms_test -v 1.0

I think it isn't related to the verbosity in this case.

ltsdw avatar Jun 04 '24 20:06 ltsdw

This is the line in the dkms.in file that outputs the command:

...
# $2 = Description of command to run
...
[[ $verbose ]] && echo -e "$1" || echo -en "$2..."
...

Exactly because $verbose is null, the second output is given echo -en "$2...". Somehow the actual command is getting inside the $2 instead of only the description of what it's being done.

ltsdw avatar Jun 04 '24 20:06 ltsdw

First, sorry about that, right now the arch linux package is sitting on version 3.0.12, I had to build 3.0.13 myself. Now the tests passes but with some caveats:

Tests output: dkms_tests_output.txt

Here are the changes I made to the run_test.sh (it isn't a suggestion to change to exact this, as some things I had to hardcode because I couldn't get around it):

@@ -496,9 +496,9 @@
 echo 'Checking modinfo'
 run_with_expected_output sh -c "modinfo /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext} | head -n 4" << EOF
 filename:       /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext}
-version:        1.0
-description:    A Simple dkms test module
 license:        GPL
+description:    A Simple dkms test module
+version:        1.0
 EOF
 
 if (( NO_SIGNING_TOOL == 0 )); then
@@ -621,9 +621,9 @@
 echo 'Checking modinfo'
 run_with_expected_output sh -c "modinfo /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext} | head -n 4" << EOF
 filename:       /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext}
-version:        1.0
-description:    A Simple dkms test module
 license:        GPL
+description:    A Simple dkms test module
+version:        1.0
 EOF
 
 if (( NO_SIGNING_TOOL == 0 )); then
@@ -1419,7 +1419,7 @@
 Cleaning build area...
 Building module(s)...(bad exit status: 2)
 Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
 Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
 Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.
 dkms autoinstall on ${KERNEL_VER}/${KERNEL_ARCH} failed for dkms_failing_test(10)
@@ -1438,7 +1438,7 @@
 Cleaning build area...
 Building module(s)...(bad exit status: 2)
 Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
 Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
 Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.
 dkms autoinstall on ${KERNEL_VER}/${KERNEL_ARCH} failed for dkms_failing_test(10)
@@ -1587,7 +1587,7 @@
 Cleaning build area...
 Building module(s)...(bad exit status: 2)
 Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
 Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
 Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.

The explanation, my modinfo shows first the license, then the description and only then the version:

$ modinfo dkms_test.ko.zst
filename:       /lib/modules/6.9.1-2/updates/dkms/dkms_test.ko.zst
license:        GPL
description:    A Simple dkms test module
version:        1.0
vermagic:       6.9.1-2 SMP preempt mod_unload 
name:           dkms_test
...

And about the variables CC, LD and OBJCOPY, right now the tests doesn't expect to encounter them, this isn't inherently from my PR, as we already been exposing the CC, and LD before (so the tests will fail on a clang compiled kernel even without my patch, so we may have to fix that. I even tried, but the said variables aren't exposed so we can't do something like CC=${CC} and so on, because CC isn't defined).

One suggestion would be to use in the run_test.sh a mechanism similar to what is done in the dkms.in right now to tell if the kernel was compiled using clang. Or we could change the dkms.in make_commands to include the CC, LD, etc. even when compiling using gcc, that would make the changes needed in the run_test.sh minimal.

Other than that the tests passes normally.

ltsdw avatar Jun 04 '24 22:06 ltsdw

Don't recall personally seeing varying order of the modinfo output... even tough it was reported before. Let me apply a quick hack for dkms and will set a fix for kmod upstream in a bit.

Looking at the make messages, I think we can strip the extra CC/LD/OBJDUMP entries in genericize_expected_output().

Let me prep a PR with the above two fixes, then you can rebase (+ hopefully add some tests) your work on top.

evelikov avatar Jun 06 '24 16:06 evelikov

PR https://github.com/dell/dkms/pull/422 should address the issues - please give it a test. Thanks in advance.

Looking at the kernel documentation - I wonder if we shouldn't just use the LLVM=... variable all together. I'm not sure if tracking each and every variable, is going to be portable across multiple kernel releases.

[1] https://www.kernel.org/doc/html/latest/kbuild/llvm.html

evelikov avatar Jun 06 '24 18:06 evelikov

Yes, #422 fixes the failing tests for clang.

Looking at the kernel documentation - I wonder if we shouldn't just use the LLVM=... variable all together. I'm not sure if tracking each and every variable, is going to be portable across multiple kernel releases.

[1] https://www.kernel.org/doc/html/latest/kbuild/llvm.html

Yes that would be better indeed (also no change in the tests would be needed, unless you want to remove the CC, LD, bits). Would you like me to make this changes reflect this PR? We would still have to check for clang to use the correct strip though (even though LLVM=1 sets the STRIP envar).

ltsdw avatar Jun 06 '24 19:06 ltsdw

Would you like me to make this changes reflect this PR?

Whichever you're comfortable with, really.

We would still have to check for clang to use the correct strip though (even though LLVM=1 sets the STRIP envar).

We do check the STRIP variable (array really) from the module's dkms.conf and make it an in-dkms variable. Even so we can technically honour the STRIP set by LLVM=1. If you choose to do this (again up-to you), please update the man page.

evelikov avatar Jun 06 '24 20:06 evelikov

Sorry about that, I was trying not to create a bunch of commits while rebasing my local branch lol 😅

We do check the STRIP variable (array really) from the module's dkms.conf and make it an in-dkms variable. Even so we can technically honour the STRIP set by LLVM=1.

I didn't understand this part, for example, without these changes:

-        [[ ${strip[$count]} != no ]] && strip -g "$built_module"
+        if [[ ${strip[$count]} != no ]] && [[ ${CC} == "clang" ]]; then
+            llvm-strip -g "$built_module"
+        elif [[ ${strip[$count]} != no ]]; then
+            strip -g "$built_module"
+        fi

Even when using the LLVM=1 the wrong strip command will be picked. That's what I meant by still have to check for clang to use the right strip command.

If you choose to do this (again up-to you), please update the man page.

If I understood the above incorrectly or you have any suggestion I'll be more than happy to implement and update this PR 😄

ltsdw avatar Jun 06 '24 22:06 ltsdw

OK, nvm, I understand now:

        case ${STRIP[$index]} in
            [nN]*)
                strip[$index]="no"
                ;;
            [yY]*)
                strip[$index]="yes"
                ;;
            '')
                strip[$index]=${strip[0]:-yes}
                ;;
        esac

The STRIP envar and the dkms.conf's STRIP array are two different things (got a little confused there). I get it now.

Well, if you don't like the approach of using the same logic as before (checking if the module should be stripped):

        if [[ ${strip[$count]} != no ]] && [[ ${CC} == "clang" ]]; then
            llvm-strip -g "$built_module"
        elif [[ ${strip[$count]} != no ]]; then
            strip -g "$built_module"
        fi

I can think of something else, but really I don't find it deemed necessary. The result would be the same, if the STRIP of dkms.conf is set "yes" the modules will be stripped, if set to "no", no stripping, and if empty the modules will be stripped, all paths leads to the right strip command being used and the LLVM=1 STRIP being honored no matter what.

Now, if I understood all wrong and by "honoring the STRIP set by LLVM=1" you meant that if clang is being used, automatically we will use LLVM=1and inherently STRIP=llvm-strip will be set, meaning that the modules should be stripped by llvm-strip no matter what, I don't think it's a good idea, having control of when to strip modules in dkms.conf is better. (basically the way it's right now).

ltsdw avatar Jun 06 '24 23:06 ltsdw

Sorry for the radio silence - I've been offline for the last few weeks. Will have a look later today/tomorrow and if there's only minor nits, will amend and push.

evelikov avatar Jul 05 '24 12:07 evelikov

@evelikov Would be cool, if you could review this in the near future. 560 Driver will use the open modules as default and this fixed an outstanding issue with them.

ptr1337 avatar Aug 04 '24 17:08 ptr1337

@evelikov friendly ping 😅

ltsdw avatar Sep 11 '24 03:09 ltsdw

Were using this now too on CachyOS as default, mainly due the nvidia-open-dkms issue. Has been used for 1 months and no issues reported.

ptr1337 avatar Sep 11 '24 18:09 ptr1337

Thanks for the poke, I was taking a short holiday from dkms. Checking through now, will merge later tonight/early tomorrow....

I'm thinking how we can easily test this. If anyone has patches that would be welcome.

evelikov avatar Sep 11 '24 19:09 evelikov

This patch certainly works for me. I think it should be fixed asap as lto-ed kernels are quite unusable with dkms without it

xuanruiqi avatar Oct 16 '24 16:10 xuanruiqi