distribution icon indicating copy to clipboard operation
distribution copied to clipboard

DKMS module for nvidia driver failed to build after upgrade to 43360.

Open hksdpc255 opened this issue 8 months ago • 8 comments

I upgrade clearlinux from 43300 to 43360. NVIDIA driver 570.144 failed to build its kernel module and generate a 8.3MB fail log.

$ head -n 120 /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Tue May  6 08:40:14 2025
installer version: 570.144

PATH: /usr/local/bin:/usr/bin/haswell/avx512_1:/usr/bin/haswell:/usr/bin:/opt/3rd-party/bin:/usr/share/bcc/tools:/usr/share/bcc/tools/old:/opt/cuda-12.8.1_570.124.06/bin:/opt/cuda/bin:/opt/nvidia/bin:/usr/share/bcc/tools:/usr/share/bcc/tools/old:/opt/cuda-12.8.1_570.124.06/bin:/opt/cuda/bin:/opt/nvidia/bin

nvidia-installer command line:
    ./nvidia-installer
    --kernel-name=6.6.89-1486.ltsprev
    --no-precompiled-interface
    --no-nvidia-modprobe
    --no-distro-scripts
    --no-rebuild-initramfs
    --skip-module-load
    --no-nouveau-check
    --no-disable-nouveau
    --no-x-check
    --dkms
    --silent
    --allow-installation-with-running-driver
    --kernel-module-type=open
    --compat32-prefix=/opt/nvidia
    --compat32-libdir=lib32
    --x-prefix=/opt/nvidia
    --x-module-path=/opt/nvidia/lib64/xorg/modules
    --x-library-path=/opt/nvidia/lib64
    --x-sysconfig-path=/etc/X11/xorg.conf.d
    --opengl-prefix=/opt/nvidia
    --opengl-libdir=lib64
    --wine-prefix=/opt/nvidia
    --utility-prefix=/opt/nvidia
    --utility-libdir=lib64
    --xdg-data-dir=/opt/nvidia/share
    --documentation-prefix=/opt/nvidia
    --application-profile-path=/etc/nvidia/nvidia-application-profiles-rc.d
    --module-signing-key-path=/opt/nvidia/share
    --force-libglx-indirect
    --glvnd-egl-config-path=/etc/glvnd/egl_vendor.d
    --egl-external-platform-config-path=/etc/egl/egl_external_platform.d
    --systemd-unit-prefix=/usr/local/lib/systemd/system
    --systemd-sleep-prefix=/usr/local/lib/systemd/system-sleep

Using built-in stream user interface
-> Detected 64 CPUs online; setting concurrency level to 32.
-> Scanning the initramfs with lsinitrd...
-> /usr/bin/lsinitrd requires a file path argument, but none was given.
-> /usr/bin/lsinitrd requires a file path argument, but none was given.
-> Initramfs scan failed.
WARNING: Unable to determine the default library path. The path /opt/nvidia/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: Unable to determine the default X library path. The path /opt/nvidia/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: An NVIDIA kernel module 'nvidia-modeset' appears to be already loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Some of the sanity checks that nvidia-installer performs to detect potential installation problems are not possible while an NVIDIA kernel module is running.
-> Would you like to continue installation and skip the sanity checks? If not, please abort the installation, then close any programs which may be using the NVIDIA GPU(s), and attempt installation again. (Answer: Continue installation)
WARNING: Continuing installation despite the presence of a loaded NVIDIA kernel module.  Some sanity checks will not be performed.  It is strongly recommended that you reboot your computer after installation is complete.  If the installation is not successful after rebooting the computer, you can run `nvidia-uninstall` to attempt to remove the NVIDIA driver.
-> Kernel module load tests will be skipped.
-> Installing NVIDIA driver version 570.144.
-> Not probing for precompiled kernel interfaces.
-> Performing CC sanity check with CC="/usr/bin/cc".
-> Performing CC check.
-> Not probing for precompiled kernel interfaces.
-> Kernel source path: '/lib/modules/6.6.89-1486.ltsprev/build'
-> Kernel output path: '/lib/modules/6.6.89-1486.ltsprev/build'
-> Performing Compiler check.
-> Performing Dom0 check.
-> Performing Xen check.
-> Performing PREEMPT_RT check.
-> Performing vgpu_kvm check.
-> Cleaning kernel module build directory.
   executing: 'cd kernel-open; /usr/bin/make -k -j32  NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/6.6.89-1486.ltsprev/build" SYSOUT="/lib/modules/6.6.89-1486.ltsprev/build" clean'...
   rm -f -r conftest
   make[1]: Entering directory '/usr/lib/modules/6.6.89-1486.ltsprev/build'
   make[1]: Leaving directory '/usr/lib/modules/6.6.89-1486.ltsprev/build'
-> Failed to estimate output lines: /bin/sh: line 1: /lib/modules/6.6.87-1484.ltsprev/build/.config: No such file or directory
conftests:300 objects:198 modules:5
-> Building kernel modules
   executing: 'cd kernel-open; /usr/bin/make -k -j32  NV_EXCLUDE_KERNEL_MODULES="" SYSSRC="/lib/modules/6.6.89-1486.ltsprev/build" SYSOUT="/lib/modules/6.6.89-1486.ltsprev/build" '...
   make[1]: Entering directory '/usr/lib/modules/6.6.89-1486.ltsprev/build'
   warning: the compiler differs from the one used to build the kernel
     The kernel was built by: gcc (Clear Linux OS for Intel Architecture) 14.2.1 20250410 releases/gcc-14.2.0-1067-g779e002a1d
     You are using:           gcc (Clear Linux OS for Intel Architecture) 15.1.1 20250429 releases/gcc-15.1.0-15-g68a75e3c0d
   
   Warning: Compiler version check failed:
   
   The major and minor number of the compiler used to
   compile the kernel:
   
   gcc (Clear Linux OS for Intel Architecture) 14.2.1 20250410 releases/gcc-14.2.0-1067-g779e002a1d, GNU ld (GNU Binutils) 2.44.0
   
   does not match the compiler used here:
   
   gcc (Clear Linux OS for Intel Architecture) 15.1.1 20250429 releases/gcc-15.1.0-15-g68a75e3c0d
   Copyright (C) 2025 Free Software Foundation, Inc.
   This is free software; see the source for copying conditions.  There is NO
   warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
   
   
   It is recommended to set the CC environment variable
   to the compiler that was used to compile the kernel.
   
   To skip the test and silence this warning message, set
   the IGNORE_CC_MISMATCH environment variable to "1".
   However, mixing compiler versions between the kernel
   and kernel modules can result in subtle bugs that are
   difficult to diagnose.
   
   *** Failed CC version check. ***
   
     SYMLINK /tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open/nvidia/nv-kernel.o
     SYMLINK /tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open/nvidia-modeset/nv-modeset-kernel.o
    CONFTEST: hash__remap_4k_pfn
    CONFTEST: set_pages_uc
    CONFTEST: list_is_first
    CONFTEST: set_memory_uc
    CONFTEST: set_memory_array_uc
    CONFTEST: set_pages_array_uc
    CONFTEST: ioremap_cache
    CONFTEST: ioremap_wc
    CONFTEST: ioremap_driver_hardened
    CONFTEST: ioremap_driver_hardened_wc
    CONFTEST: ioremap_cache_shared
    CONFTEST: pci_get_domain_bus_and_slot

hksdpc255 avatar May 06 '25 09:05 hksdpc255

Here is the tail:

$ tail -n 20 /var/log/nvidia-installer.log
/tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open/common/inc/nv-linux.h: In function 'nv_phys_to_dma':
/tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open/common/inc/nv-linux.h:711:12: error: implicit declaration of function 'phys_to_dma'; did you mean 'nv_phys_to_dma'? [-Wimplicit-function-declaration]
  711 |     return phys_to_dma(dev, pa);
      |            ^~~~~~~~~~~
      |            nv_phys_to_dma
/tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open/common/inc/nv-linux.h: In function 'nv_is_dma_direct':
/tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open/common/inc/nv-linux.h:1217:9: error: implicit declaration of function 'dma_is_direct'; did you mean 'd_is_dir'? [-Wimplicit-function-declaration]
 1217 |     if (dma_is_direct(get_dma_ops(dev)))
      |         ^~~~~~~~~~~~~
      |         d_is_dir
make[3]: *** [scripts/Makefile.build:243: /tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open/nvidia/i2c_nvswitch.o] Error 1
make[3]: Target '/tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open/' not remade because of errors.
make[2]: *** [/usr/lib/modules/6.6.89-1486.ltsprev/build/Makefile:1924: /tmp/selfgz89752/NVIDIA-Linux-x86_64-570.144/kernel-open] Error 2
make[2]: Target 'modules' not remade because of errors.
make[1]: *** [Makefile:234: __sub-make] Error 2
make[1]: Target 'modules' not remade because of errors.
make[1]: Leaving directory '/usr/lib/modules/6.6.89-1486.ltsprev/build'
make: *** [Makefile:115: modules] Error 2
ERROR: The nvidia kernel module was not created.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

hksdpc255 avatar May 06 '25 09:05 hksdpc255

We can't really do anything about an external vendor's driver being incompatible with the kernel version we ship. You may have to try a different kernel bundle from us, or may need to get a different version of the vendor's driver package that is compatible with the kernel version you've installed.

For reference, today we ship the following kernel versions:

bundle version
kernel-native 6.14.5
kernel-ltscurrent 6.12.27
kernel-ltsprev 6.6.89

I couldn't find any information about compatible kernel versions for that driver, but if I had to guess, I'd use our kernel-ltscurrent bundle.

We follow the kernel classifications on https://www.kernel.org/ -- as you can see, our kernel-native tracks the latest "stable", kernel-ltscurrent tracks the latest "longterm", and kernel-ltsprev tracks the second-latest "longterm".

bwarden avatar May 08 '25 18:05 bwarden

part of the issue is that the nvidia build scripts check if the kernel and is built with the current gcc -- which kind of goes bang on any minor gcc update that is until there is a kernel rebuild which is usually within a day

On Thu, May 8, 2025 at 11:31 AM Brett T. Warden @.***> wrote:

bwarden left a comment (clearlinux/distribution#3304) https://github.com/clearlinux/distribution/issues/3304#issuecomment-2863924174

We can't really do anything about an external vendor's driver being incompatible with the kernel version we ship. You may have to try a different kernel bundle from us, or may need to get a different version of the vendor's driver package that is compatible with the kernel version you've installed.

For reference, today we ship the following kernel versions: bundle version kernel-native 6.14.5 kernel-ltscurrent 6.12.27 kernel-ltsprev 6.6.89

— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3304#issuecomment-2863924174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FOC7OBL7B3FZFTNLOT25OPHRAVCNFSM6AAAAAB4QULPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNRTHEZDIMJXGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

fenrus75 avatar May 08 '25 18:05 fenrus75

@bwarden I went back all the way to ltsprev but I still get the compilation errors because of the mismatch between shipped kernel and GCC available:

warning: the compiler differs from the one used to build the kernel
     The kernel was built by: gcc (Clear Linux OS for Intel Architecture) 14.2.1 20250410 releases/gcc-14.2.0-1067-g779e002a1d
     You are using:           gcc (Clear Linux OS for Intel Architecture) 15.1.1 20250429 releases/gcc-15.1.0-15-g68a75e3c0d
   
   Warning: Compiler version check failed:
   
   The major and minor number of the compiler used to
   compile the kernel:
   
   gcc (Clear Linux OS for Intel Architecture) 14.2.1 20250410 releases/gcc-14.2.0-1067-g779e002a1d, GNU ld (GNU Binutils) 2.44.0
   
   does not match the compiler used here:
   
   gcc (Clear Linux OS for Intel Architecture) 15.1.1 20250429 releases/gcc-15.1.0-15-g68a75e3c0d

I can't find a c-extras-gcc14 bundle I can install. Any other options?

cengique avatar May 16 '25 16:05 cengique

The linux-ltsprev rebuild with gcc 15.1 apparently didn't work; I'll try to fix that. I would go with the linux-ltscurrent kernel though, but it'll probably be another day or two before the build with gcc 15.1 is released.

bwarden avatar May 16 '25 16:05 bwarden

I downgrade to 43320, which is the last published version with gcc14, and successfully build the kernel module. Then I upgrade to latest version. The DKMS do not trigger a rebuild and the kernel module is still functional.

hksdpc255 avatar May 19 '25 02:05 hksdpc255

In release 43490, linux-ltsprev and linux-ltscurrent are now built successfully with gcc 15.1, which should resolve the gcc version conflict. I'm not sure whether you'll still have compatibility errors between the NVIDIA driver source and our kernel, though.

bwarden avatar May 20 '25 17:05 bwarden

Thank you! I got the Nvidia driver version 570.153.02 to compile with ltscurrent (org.clearlinux.ltscurrent.6.12.29-1493) on release 43500.

cengique avatar May 22 '25 23:05 cengique