dkms: fix building nvidia open kernel modules against clang and thin/full lto compiled kernel
Right now there's some issues when trying to compile open-gpu-kernel-modules against a kernel compiled using Clang and Thin/Full LTO.
Also the script will pick the wrong version of the strip command otherwise the module will fail to compile successfully:
strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
...
Is also necessary to set the OBJCOPY environment variable, otherwise even though the modules will compile the system will refuse to boot on it.
Fixes #416.
Massive thanks for the fix. Can I trouble you for a simple test?
Say, for any/all of the Archlinux targets:
- fetch and install the custom clang/lto kernel
- run the tests
Thank you @evelikov .
Besides all the external modules I have compiling and installing normally on my system. I ran the available dkms tests, this were the results:
Using kernel 6.9.1-2/x86_64
Checking module compression ...
config: CONFIG_MODULE_COMPRESS_ZSTD=y
files: /lib/modules/6.9.1-2/kernel/arch/x86/crypto/aesni-intel.ko.zst
Expected extension: .zst
Preparing a clean test environment
Test framework file hijacking
Adding the test module by version (expected error)
Adding the test module by directory
Adding the test module again (expected error)
Adding the test module by version (expected error)
Building the test module
--- test_cmd_expected_output.log 2024-06-04 16:42:18.060006565 -0300
+++ test_cmd_output.log 2024-06-04 16:42:19.721350689 -0300
@@ -1,6 +1,6 @@
Building module:
Cleaning build area...
-Building module(s)...
+make -j1 KERNELRELEASE=6.9.1-2 -C /usr/lib/modules/6.9.1-2/build M=/var/lib/dkms/dkms_test/1.0/build CC=clang OBJCOPY=llvm-objcopy LD=ld.lld...
Signing module /var/lib/dkms/dkms_test/1.0/build/dkms_test.ko
Cleaning build area...
Error: unexpected output from: dkms build -k 6.9.1-2 -m dkms_test -v 1.0
It builds and install the dkms_test normally but differs only by outputting the command used to build the module.
All seems fine.
Something is causing the exact build command to be printed, instead of the pretty message. This should only happen when the verbose flag is set. Either by:
- calling "dkms --verbose ...", or
- having "verbose" set in the
/etc/dkms/framework*files (check also subfolders) - ... we might also be inheriting the existing shell env. bits - perhaps "unset verbose" beforehand, just in case?
I'm leaning that there's a config file enabling this.
Alternatively... does the test suit work on your end, with normal (non clang/lto) kernel?
Alternatively... does the test suit work on your end, with normal (non clang/lto) kernel?
Nope, so I don't think it's related to the setup to build the kernel.
...
Building the test module
--- test_cmd_expected_output.log 2024-06-04 17:23:35.389147611 -0300
+++ test_cmd_output.log 2024-06-04 17:23:37.280020621 -0300
@@ -1,6 +1,6 @@
Building module:
Cleaning build area...
-Building module(s)...
+make -j1 KERNELRELEASE=6.9.1-arch1-2 -C /usr/lib/modules/6.9.1-arch1-2/build M=/var/lib/dkms/dkms_test/1.0/build...
Signing module /var/lib/dkms/dkms_test/1.0/build/dkms_test.ko
Cleaning build area...
Error: unexpected output from: dkms build -k 6.9.1-arch1-2 -m dkms_test -v 1.0
Checking /etc/dkms/framework:
# Verbosity setting, will be active if set to a non-null value:
# verbose=""
Actually I've set the verbosity /etc/dkms/framework just for testing, and it becomes way more verbose:
...
Building the test module
--- test_cmd_expected_output.log 2024-06-04 17:37:15.738406354 -0300
+++ test_cmd_output.log 2024-06-04 17:37:17.378414527 -0300
@@ -1,6 +1,14 @@
Building module:
-Cleaning build area...
-Building module(s)...
+make -C /usr/lib/modules/6.9.1-arch1-2/build M=/var/lib/dkms/dkms_test/1.0/build clean
+make: Entering directory '/usr/lib/modules/6.9.1-arch1-2/build'
+make: Leaving directory '/usr/lib/modules/6.9.1-arch1-2/build'
+
+{ make -j1 KERNELRELEASE=6.9.1-arch1-2 -C /usr/lib/modules/6.9.1-arch1-2/build M=/var/lib/dkms/dkms_test/1.0/build; } >> /var/lib/dkms/dkms_test/1.0/build/make.log 2>&1
+
Signing module /var/lib/dkms/dkms_test/1.0/build/dkms_test.ko
-Cleaning build area...
+make -C /usr/lib/modules/6.9.1-arch1-2/build M=/var/lib/dkms/dkms_test/1.0/build clean
+make: Entering directory '/usr/lib/modules/6.9.1-arch1-2/build'
+ CLEAN /var/lib/dkms/dkms_test/1.0/build/Module.symvers
+make: Leaving directory '/usr/lib/modules/6.9.1-arch1-2/build'
+
Error: unexpected output from: dkms build -k 6.9.1-arch1-2 -m dkms_test -v 1.0
I think it isn't related to the verbosity in this case.
This is the line in the dkms.in file that outputs the command:
...
# $2 = Description of command to run
...
[[ $verbose ]] && echo -e "$1" || echo -en "$2..."
...
Exactly because $verbose is null, the second output is given echo -en "$2...". Somehow the actual command is getting inside the $2 instead of only the description of what it's being done.
First, sorry about that, right now the arch linux package is sitting on version 3.0.12, I had to build 3.0.13 myself. Now the tests passes but with some caveats:
Tests output: dkms_tests_output.txt
Here are the changes I made to the run_test.sh (it isn't a suggestion to change to exact this, as some things I had to hardcode because I couldn't get around it):
@@ -496,9 +496,9 @@
echo 'Checking modinfo'
run_with_expected_output sh -c "modinfo /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext} | head -n 4" << EOF
filename: /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext}
-version: 1.0
-description: A Simple dkms test module
license: GPL
+description: A Simple dkms test module
+version: 1.0
EOF
if (( NO_SIGNING_TOOL == 0 )); then
@@ -621,9 +621,9 @@
echo 'Checking modinfo'
run_with_expected_output sh -c "modinfo /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext} | head -n 4" << EOF
filename: /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext}
-version: 1.0
-description: A Simple dkms test module
license: GPL
+description: A Simple dkms test module
+version: 1.0
EOF
if (( NO_SIGNING_TOOL == 0 )); then
@@ -1419,7 +1419,7 @@
Cleaning build area...
Building module(s)...(bad exit status: 2)
Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.
dkms autoinstall on ${KERNEL_VER}/${KERNEL_ARCH} failed for dkms_failing_test(10)
@@ -1438,7 +1438,7 @@
Cleaning build area...
Building module(s)...(bad exit status: 2)
Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.
dkms autoinstall on ${KERNEL_VER}/${KERNEL_ARCH} failed for dkms_failing_test(10)
@@ -1587,7 +1587,7 @@
Cleaning build area...
Building module(s)...(bad exit status: 2)
Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.
The explanation, my modinfo shows first the license, then the description and only then the version:
$ modinfo dkms_test.ko.zst
filename: /lib/modules/6.9.1-2/updates/dkms/dkms_test.ko.zst
license: GPL
description: A Simple dkms test module
version: 1.0
vermagic: 6.9.1-2 SMP preempt mod_unload
name: dkms_test
...
And about the variables CC, LD and OBJCOPY, right now the tests doesn't expect to encounter them, this isn't inherently from my PR, as we already been exposing the CC, and LD before (so the tests will fail on a clang compiled kernel even without my patch, so we may have to fix that. I even tried, but the said variables aren't exposed so we can't do something like CC=${CC} and so on, because CC isn't defined).
One suggestion would be to use in the run_test.sh a mechanism similar to what is done in the dkms.in right now to tell if the kernel was compiled using clang. Or we could change the dkms.in make_commands to include the CC, LD, etc. even when compiling using gcc, that would make the changes needed in the run_test.sh minimal.
Other than that the tests passes normally.
Don't recall personally seeing varying order of the modinfo output... even tough it was reported before. Let me apply a quick hack for dkms and will set a fix for kmod upstream in a bit.
Looking at the make messages, I think we can strip the extra CC/LD/OBJDUMP entries in genericize_expected_output().
Let me prep a PR with the above two fixes, then you can rebase (+ hopefully add some tests) your work on top.
PR https://github.com/dell/dkms/pull/422 should address the issues - please give it a test. Thanks in advance.
Looking at the kernel documentation - I wonder if we shouldn't just use the LLVM=... variable all together. I'm not sure if tracking each and every variable, is going to be portable across multiple kernel releases.
[1] https://www.kernel.org/doc/html/latest/kbuild/llvm.html
Yes, #422 fixes the failing tests for clang.
Looking at the kernel documentation - I wonder if we shouldn't just use the LLVM=... variable all together. I'm not sure if tracking each and every variable, is going to be portable across multiple kernel releases.
[1] https://www.kernel.org/doc/html/latest/kbuild/llvm.html
Yes that would be better indeed (also no change in the tests would be needed, unless you want to remove the CC, LD, bits). Would you like me to make this changes reflect this PR? We would still have to check for clang to use the correct strip though (even though LLVM=1 sets the STRIP envar).
Would you like me to make this changes reflect this PR?
Whichever you're comfortable with, really.
We would still have to check for clang to use the correct strip though (even though LLVM=1 sets the STRIP envar).
We do check the STRIP variable (array really) from the module's dkms.conf and make it an in-dkms variable. Even so we can technically honour the STRIP set by LLVM=1. If you choose to do this (again up-to you), please update the man page.
Sorry about that, I was trying not to create a bunch of commits while rebasing my local branch lol 😅
We do check the STRIP variable (array really) from the module's dkms.conf and make it an in-dkms variable. Even so we can technically honour the STRIP set by LLVM=1.
I didn't understand this part, for example, without these changes:
- [[ ${strip[$count]} != no ]] && strip -g "$built_module"
+ if [[ ${strip[$count]} != no ]] && [[ ${CC} == "clang" ]]; then
+ llvm-strip -g "$built_module"
+ elif [[ ${strip[$count]} != no ]]; then
+ strip -g "$built_module"
+ fi
Even when using the LLVM=1 the wrong strip command will be picked. That's what I meant by still have to check for clang to use the right strip command.
If you choose to do this (again up-to you), please update the man page.
If I understood the above incorrectly or you have any suggestion I'll be more than happy to implement and update this PR 😄
OK, nvm, I understand now:
case ${STRIP[$index]} in
[nN]*)
strip[$index]="no"
;;
[yY]*)
strip[$index]="yes"
;;
'')
strip[$index]=${strip[0]:-yes}
;;
esac
The STRIP envar and the dkms.conf's STRIP array are two different things (got a little confused there). I get it now.
Well, if you don't like the approach of using the same logic as before (checking if the module should be stripped):
if [[ ${strip[$count]} != no ]] && [[ ${CC} == "clang" ]]; then
llvm-strip -g "$built_module"
elif [[ ${strip[$count]} != no ]]; then
strip -g "$built_module"
fi
I can think of something else, but really I don't find it deemed necessary. The result would be the same, if the STRIP of dkms.conf is set "yes" the modules will be stripped, if set to "no", no stripping, and if empty the modules will be stripped, all paths leads to the right strip command being used and the LLVM=1 STRIP being honored no matter what.
Now, if I understood all wrong and by "honoring the STRIP set by LLVM=1" you meant that if clang is being used, automatically we will use LLVM=1and inherently STRIP=llvm-strip will be set, meaning that the modules should be stripped by llvm-strip no matter what, I don't think it's a good idea, having control of when to strip modules in dkms.conf is better. (basically the way it's right now).
Sorry for the radio silence - I've been offline for the last few weeks. Will have a look later today/tomorrow and if there's only minor nits, will amend and push.
@evelikov Would be cool, if you could review this in the near future. 560 Driver will use the open modules as default and this fixed an outstanding issue with them.
@evelikov friendly ping 😅
Were using this now too on CachyOS as default, mainly due the nvidia-open-dkms issue. Has been used for 1 months and no issues reported.
Thanks for the poke, I was taking a short holiday from dkms. Checking through now, will merge later tonight/early tomorrow....
I'm thinking how we can easily test this. If anyone has patches that would be welcome.
This patch certainly works for me. I think it should be fixed asap as lto-ed kernels are quite unusable with dkms without it