DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Installing issue with gdrdrv

Open kevinhuangxf opened this issue 10 months ago • 18 comments

Hi~ Thanks for the great work!

I'm installing the GDRCopy dependencies but I encouter the following issue.

The README file indicate a 2.4-4 version while my installation appears a 2.5-1 version:

(lara) root:/workspace/code/DeepEP/gdrcopy/packages# cp -rvf /var/lib/dkms/gdrdrv/2.5/ /var/lib/dkms/gdrdrv/2.5-1
'/var/lib/dkms/gdrdrv/2.5/' -> '/var/lib/dkms/gdrdrv/2.5-1'
'/var/lib/dkms/gdrdrv/2.5/source' -> '/var/lib/dkms/gdrdrv/2.5-1/source'
'/var/lib/dkms/gdrdrv/2.5/build' -> '/var/lib/dkms/gdrdrv/2.5-1/build'
'/var/lib/dkms/gdrdrv/2.5/build/scripts' -> '/var/lib/dkms/gdrdrv/2.5-1/build/scripts'
'/var/lib/dkms/gdrdrv/2.5/build/scripts/test_gdrdrv_HAVE_PROC_OPS.sh' -> '/var/lib/dkms/gdrdrv/2.5-1/build/scripts/test_gdrdrv_HAVE_PROC_OPS.sh'
'/var/lib/dkms/gdrdrv/2.5/build/scripts/test_gdrdrv_HAVE_VM_FLAGS_SET.sh' -> '/var/lib/dkms/gdrdrv/2.5-1/build/scripts/test_gdrdrv_HAVE_VM_FLAGS_SET.sh'
'/var/lib/dkms/gdrdrv/2.5/build/Makefile' -> '/var/lib/dkms/gdrdrv/2.5-1/build/Makefile'
'/var/lib/dkms/gdrdrv/2.5/build/dkms.conf' -> '/var/lib/dkms/gdrdrv/2.5-1/build/dkms.conf'
'/var/lib/dkms/gdrdrv/2.5/build/gdrdrv.c' -> '/var/lib/dkms/gdrdrv/2.5-1/build/gdrdrv.c'
'/var/lib/dkms/gdrdrv/2.5/build/gdrdrv.h' -> '/var/lib/dkms/gdrdrv/2.5-1/build/gdrdrv.h'
'/var/lib/dkms/gdrdrv/2.5/build/nv-p2p-dummy.c' -> '/var/lib/dkms/gdrdrv/2.5-1/build/nv-p2p-dummy.c'
'/var/lib/dkms/gdrdrv/2.5/build/make.log' -> '/var/lib/dkms/gdrdrv/2.5-1/build/make.log'
(lara) root:/workspace/code/DeepEP/gdrcopy/packages# 
(lara) root:/workspace/code/DeepEP/gdrcopy/packages# 
(lara) root:/workspace/code/DeepEP/gdrcopy/packages# dpkg -i gdrdrv-dkms_2.5-1_amd64.Ubuntu20_04.deb 
(Reading database ... 91997 files and directories currently installed.)
Preparing to unpack gdrdrv-dkms_2.5-1_amd64.Ubuntu20_04.deb ...

------------------------------
Deleting module version: 2.5
completely from the DKMS tree.
------------------------------
Done.
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of stop.
Unpacking gdrdrv-dkms:amd64 (2.5-1) over (2.5-1) ...
Setting up gdrdrv-dkms:amd64 (2.5-1) ...
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)
debconf: falling back to frontend: Readline
Loading new gdrdrv-2.5 DKMS files...
It is likely that 5.15.0-88-generic belongs to a chroot's host
Building for 5.15.0-88-generic and 5.4.0-208-generic
Building for architecture x86_64
Building initial module for 5.15.0-88-generic
Error! Bad return status for module build on kernel: 5.15.0-88-generic (x86_64)
Consult /var/lib/dkms/gdrdrv/2.5/build/make.log for more information.
dpkg: error processing package gdrdrv-dkms:amd64 (--install):
 installed gdrdrv-dkms:amd64 package post-installation script subprocess returned error exit status 10
Errors were encountered while processing:
 gdrdrv-dkms:amd64

The make.log file is:

DKMS make.log for gdrdrv-2.5 for kernel 5.15.0-88-generic (x86_64)
Mon Feb 24 23:47:28 PST 2025
grep: NVIDIA_DRIVER_MISSING/: No such file or directory
grep: NVIDIA_DRIVER_MISSING/: No such file or directory
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=NVIDIA_DRIVER_MISSING. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
Setting NVIDIA_IS_OPENSOURCE=
Setting HAVE_VM_FLAGS_SET=n
Setting HAVE_PROC_OPS=y
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-88-generic'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
  You are using:           gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  CC [M]  /var/lib/dkms/gdrdrv/2.5-1/build/nv-p2p-dummy.o
/var/lib/dkms/gdrdrv/2.5-1/build/nv-p2p-dummy.c:48:10: fatal error: nv-p2p.h: No such file or directory
   48 | #include "nv-p2p.h"
      |          ^~~~~~~~~~
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /var/lib/dkms/gdrdrv/2.5-1/build/nv-p2p-dummy.o] Error 1
make[1]: *** [Makefile:1909: /var/lib/dkms/gdrdrv/2.5-1/build] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-88-generic'
make: *** [Makefile:65: build] Error 2

However, I can not find the nv-p2p.h in the lib:

(lara) root:/var/lib/dkms/gdrdrv/2.5-1/build# ls
Makefile  dkms.conf  gdrdrv.c  gdrdrv.h  make.log  nv-p2p-dummy.c  scripts

Any clue for solving this issue?

Thanks.

kevinhuangxf avatar Feb 25 '25 07:02 kevinhuangxf

Looking forward to answers

Jackhally520 avatar Feb 25 '25 08:02 Jackhally520

Looking forward to answers

daneren avatar Feb 25 '25 09:02 daneren

I regret the oversight regarding the upstream update of GDRCopy. I will now specify the version of GDRCopy for clarity.

For instance, when using Ubuntu 22.04 and CUDA 12.3 as an example, the installation steps are as follows:

# Build and installation
wget https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.4.tar.gz
cd gdrcopy-2.4.4/
make -j$(nproc)
sudo make prefix=/opt/gdrcopy install

# Kernel module installation
cd packages
CUDA=/path/to/cuda ./build-deb-packages.sh
sudo dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb \
             libgdrapi_2.4.4_amd64.Ubuntu22_04.deb \
             gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.3.deb \
             gdrcopy_2.4.4_amd64.Ubuntu22_04.deb
sudo ./insmod.sh  # Load kernel modules on bare-metal system

haswelliris avatar Feb 25 '25 09:02 haswelliris

sudo ./insmod.sh ->sudo ../insmod.sh? @haswelliris

Baibaifan avatar Feb 25 '25 10:02 Baibaifan

I regret the oversight regarding the upstream update of GDRCopy. I will now specify the version of GDRCopy for clarity.

For instance, when using Ubuntu 22.04 and CUDA 12.3 as an example, the installation steps are as follows:

Build and installation

wget https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.4.tar.gz cd gdrcopy-2.4.4/ make -j$(nproc) sudo make prefix=/opt/gdrcopy install

Kernel module installation

cd packages CUDA=/path/to/cuda ./build-deb-packages.sh sudo dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb
libgdrapi_2.4.4_amd64.Ubuntu22_04.deb
gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.3.deb
gdrcopy_2.4.4_amd64.Ubuntu22_04.deb sudo ./insmod.sh # Load kernel modules on bare-metal system

Hi, the instruction asks us to install packages in the docker without rebuilding modules. However, in my case, the host is a centos machine while the docker is a ubuntu system. So, even though I have successfully installed gdrdr on the host, I do not have pre-built packages (gdrcopy, libgdrapi, and gdrcopy-tests) for the ubuntu environment. May I kindly ask how to install these packages inside the docker?

xibosun avatar Feb 25 '25 10:02 xibosun

Hi, the instruction asks us to install packages in the docker without rebuilding modules. However, in my case, the host is a centos machine while the docker is a ubuntu system. So, even though I have successfully installed gdrdr on the host, I do not have pre-built packages (gdrcopy, libgdrapi, and gdrcopy-tests) for the ubuntu environment. May I kindly ask how to install these packages inside the docker?

It is advisable to use the same distribution across the environment.

If this condition proves difficult to achieve, consider utilizing a privileged container. In this scenario, mount the host directory /usr/src/nvidia-${NVIDIA_DRIVER_VERSION} into the container, as these driver headers are necessary for a successful build. Following this, proceed with the corresponding compilation and installation.

haswelliris avatar Feb 25 '25 11:02 haswelliris

Hi, the instruction asks us to install packages in the docker without rebuilding modules. However, in my case, the host is a centos machine while the docker is a ubuntu system. So, even though I have successfully installed gdrdr on the host, I do not have pre-built packages (gdrcopy, libgdrapi, and gdrcopy-tests) for the ubuntu environment. May I kindly ask how to install these packages inside the docker?

It is advisable to use the same distribution across the environment.

If this condition proves difficult to achieve, consider utilizing a privileged container. In this scenario, mount the host directory /usr/src/nvidia-${NVIDIA_DRIVER_VERSION} into the container, as these driver headers are necessary for a successful build. Following this, proceed with the corresponding compilation and installation.

Thanks a lot!

xibosun avatar Feb 25 '25 11:02 xibosun

I encountered the same situation,

**env:** Ubuntu 22.04, CUDA 12.4

**command:** dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb libgdrapi_2.4.4_amd64.Ubuntu22_04.deb gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.4.deb gdrcopy_2.4.4_amd64.Ubuntu22_04.deb

**output: ** 
(Reading database ... 81887 files and directories currently installed.)
Preparing to unpack gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb ...
Deleting module gdrdrv-2.4.4 completely from the DKMS tree.
Unpacking gdrdrv-dkms:amd64 (2.4.4) over (2.4.4) ...
Preparing to unpack libgdrapi_2.4.4_amd64.Ubuntu22_04.deb ...
Unpacking libgdrapi:amd64 (2.4.4) over (2.4.4) ...
Preparing to unpack gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.4.deb ...
Unpacking gdrcopy-tests:amd64 (2.4.4) over (2.4.4) ...
Preparing to unpack gdrcopy_2.4.4_amd64.Ubuntu22_04.deb ...
Unpacking gdrcopy:amd64 (2.4.4) over (2.4.4) ...
Setting up gdrdrv-dkms:amd64 (2.4.4) ...
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78.)
debconf: falling back to frontend: Readline
Loading new gdrdrv-2.4.4 DKMS files...
It is likely that 5.15.0-25-generic belongs to a chroot's host
Building for 5.15.0-134-generic
Building for architecture x86_64
Building initial module for 5.15.0-134-generic
Error! Bad return status for module build on kernel: 5.15.0-134-generic (x86_64)
Consult /var/lib/dkms/gdrdrv/2.4.4/build/make.log for more information.
dpkg: error processing package gdrdrv-dkms:amd64 (--install):
 installed gdrdrv-dkms:amd64 package post-installation script subprocess returned error exit status 10
Setting up libgdrapi:amd64 (2.4.4) ...
Setting up gdrcopy-tests:amd64 (2.4.4) ...
dpkg: dependency problems prevent configuration of gdrcopy:amd64:
 gdrcopy:amd64 depends on gdrdrv-dkms (= 2.4.4); however:
  Package gdrdrv-dkms:amd64 is not configured yet.

dpkg: error processing package gdrcopy:amd64 (--install):
 dependency problems - leaving unconfigured

**log output**: 
root packages $ cat /var/lib/dkms/gdrdrv/2.4.4/build/make.log
DKMS make.log for gdrdrv-2.4.4 for kernel 5.15.0-134-generic (x86_64)
Wed Mar 12 17:24:35 CST 2025
grep: NVIDIA_DRIVER_MISSING/: No such file or directory
grep: NVIDIA_DRIVER_MISSING/: No such file or directory
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=NVIDIA_DRIVER_MISSING. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
Setting NVIDIA_IS_OPENSOURCE=
Setting HAVE_VM_FLAGS_SET=n
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-134-generic'
  CC [M]  /var/lib/dkms/gdrdrv/2.4.4/build/nv-p2p-dummy.o
/var/lib/dkms/gdrdrv/2.4.4/build/nv-p2p-dummy.c:48:10: fatal error: nv-p2p.h: No such file or directory
   48 | #include "nv-p2p.h"
      |          ^~~~~~~~~~
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /var/lib/dkms/gdrdrv/2.4.4/build/nv-p2p-dummy.o] Error 1
make[1]: *** [Makefile:1910: /var/lib/dkms/gdrdrv/2.4.4/build] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-134-generic'
make: *** [Makefile:59: build] Error 2

**directories in /usr/src**:
root packages $ ls /usr/src/
gdrdrv-2.4.4/                     linux-headers-5.15.0-134/         linux-headers-5.15.0-134-generic/ python3.10/                       tensorrt/   

there's no nvidia-* under /usr/src, does anyone know how to solve it? Thanks

Sun1Plus avatar Mar 12 '25 09:03 Sun1Plus

Hi~ Thanks for the great work!

I'm installing the GDRCopy dependencies but I encouter the following issue.

The README file indicate a 2.4-4 version while my installation appears a 2.5-1 version:

(lara) root:/workspace/code/DeepEP/gdrcopy/packages# cp -rvf /var/lib/dkms/gdrdrv/2.5/ /var/lib/dkms/gdrdrv/2.5-1
'/var/lib/dkms/gdrdrv/2.5/' -> '/var/lib/dkms/gdrdrv/2.5-1'
'/var/lib/dkms/gdrdrv/2.5/source' -> '/var/lib/dkms/gdrdrv/2.5-1/source'
'/var/lib/dkms/gdrdrv/2.5/build' -> '/var/lib/dkms/gdrdrv/2.5-1/build'
'/var/lib/dkms/gdrdrv/2.5/build/scripts' -> '/var/lib/dkms/gdrdrv/2.5-1/build/scripts'
'/var/lib/dkms/gdrdrv/2.5/build/scripts/test_gdrdrv_HAVE_PROC_OPS.sh' -> '/var/lib/dkms/gdrdrv/2.5-1/build/scripts/test_gdrdrv_HAVE_PROC_OPS.sh'
'/var/lib/dkms/gdrdrv/2.5/build/scripts/test_gdrdrv_HAVE_VM_FLAGS_SET.sh' -> '/var/lib/dkms/gdrdrv/2.5-1/build/scripts/test_gdrdrv_HAVE_VM_FLAGS_SET.sh'
'/var/lib/dkms/gdrdrv/2.5/build/Makefile' -> '/var/lib/dkms/gdrdrv/2.5-1/build/Makefile'
'/var/lib/dkms/gdrdrv/2.5/build/dkms.conf' -> '/var/lib/dkms/gdrdrv/2.5-1/build/dkms.conf'
'/var/lib/dkms/gdrdrv/2.5/build/gdrdrv.c' -> '/var/lib/dkms/gdrdrv/2.5-1/build/gdrdrv.c'
'/var/lib/dkms/gdrdrv/2.5/build/gdrdrv.h' -> '/var/lib/dkms/gdrdrv/2.5-1/build/gdrdrv.h'
'/var/lib/dkms/gdrdrv/2.5/build/nv-p2p-dummy.c' -> '/var/lib/dkms/gdrdrv/2.5-1/build/nv-p2p-dummy.c'
'/var/lib/dkms/gdrdrv/2.5/build/make.log' -> '/var/lib/dkms/gdrdrv/2.5-1/build/make.log'
(lara) root:/workspace/code/DeepEP/gdrcopy/packages# 
(lara) root:/workspace/code/DeepEP/gdrcopy/packages# 
(lara) root:/workspace/code/DeepEP/gdrcopy/packages# dpkg -i gdrdrv-dkms_2.5-1_amd64.Ubuntu20_04.deb 
(Reading database ... 91997 files and directories currently installed.)
Preparing to unpack gdrdrv-dkms_2.5-1_amd64.Ubuntu20_04.deb ...

------------------------------
Deleting module version: 2.5
completely from the DKMS tree.
------------------------------
Done.
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of stop.
Unpacking gdrdrv-dkms:amd64 (2.5-1) over (2.5-1) ...
Setting up gdrdrv-dkms:amd64 (2.5-1) ...
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)
debconf: falling back to frontend: Readline
Loading new gdrdrv-2.5 DKMS files...
It is likely that 5.15.0-88-generic belongs to a chroot's host
Building for 5.15.0-88-generic and 5.4.0-208-generic
Building for architecture x86_64
Building initial module for 5.15.0-88-generic
Error! Bad return status for module build on kernel: 5.15.0-88-generic (x86_64)
Consult /var/lib/dkms/gdrdrv/2.5/build/make.log for more information.
dpkg: error processing package gdrdrv-dkms:amd64 (--install):
 installed gdrdrv-dkms:amd64 package post-installation script subprocess returned error exit status 10
Errors were encountered while processing:
 gdrdrv-dkms:amd64

The make.log file is:

DKMS make.log for gdrdrv-2.5 for kernel 5.15.0-88-generic (x86_64)
Mon Feb 24 23:47:28 PST 2025
grep: NVIDIA_DRIVER_MISSING/: No such file or directory
grep: NVIDIA_DRIVER_MISSING/: No such file or directory
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=NVIDIA_DRIVER_MISSING. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
Setting NVIDIA_IS_OPENSOURCE=
Setting HAVE_VM_FLAGS_SET=n
Setting HAVE_PROC_OPS=y
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-88-generic'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
  You are using:           gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  CC [M]  /var/lib/dkms/gdrdrv/2.5-1/build/nv-p2p-dummy.o
/var/lib/dkms/gdrdrv/2.5-1/build/nv-p2p-dummy.c:48:10: fatal error: nv-p2p.h: No such file or directory
   48 | #include "nv-p2p.h"
      |          ^~~~~~~~~~
compilation terminated.
make[2]: *** [scripts/Makefile.build:297: /var/lib/dkms/gdrdrv/2.5-1/build/nv-p2p-dummy.o] Error 1
make[1]: *** [Makefile:1909: /var/lib/dkms/gdrdrv/2.5-1/build] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-88-generic'
make: *** [Makefile:65: build] Error 2

However, I can not find the nv-p2p.h in the lib:

(lara) root:/var/lib/dkms/gdrdrv/2.5-1/build# ls
Makefile  dkms.conf  gdrdrv.c  gdrdrv.h  make.log  nv-p2p-dummy.c  scripts

Any clue for solving this issue?

Thanks.

Image

I encoutered the same question, and my question happened because i exec the command in a container(k8s pod), I solved it by exec "install gdrdrv" in real env rather than container.

hope it helpful~

Sun1Plus avatar Mar 13 '25 06:03 Sun1Plus

Hi, the instruction asks us to install packages in the docker without rebuilding modules. However, in my case, the host is a centos machine while the docker is a ubuntu system. So, even though I have successfully installed gdrdr on the host, I do not have pre-built packages (gdrcopy, libgdrapi, and gdrcopy-tests) for the ubuntu environment. May I kindly ask how to install these packages inside the docker?

It is advisable to use the same distribution across the environment. If this condition proves difficult to achieve, consider utilizing a privileged container. In this scenario, mount the host directory /usr/src/nvidia-${NVIDIA_DRIVER_VERSION} into the container, as these driver headers are necessary for a successful build. Following this, proceed with the corresponding compilation and installation.

Thanks a lot!

Hi, Could you tell me how to solve this problem, please?

I also use centos with the ubuntu docker.

Just use "--priviledged=true -v /usr/src/nvidia-${NVIDIA_DRIVER_VERSION}:/usr/src/nvidia-${NVIDIA_DRIVER_VERSION}" is ok?

BestDreamy avatar Mar 21 '25 09:03 BestDreamy

The compliation of GDRCopy needs nv driver. You may compile this on your compute-node. (this is the cause for me)

galeselee avatar Apr 06 '25 06:04 galeselee

$ sudo ./insmod.sh insmod: ERROR: could not insert module src/gdrdrv/gdrdrv.ko: Operation not permitted INFO: driver major is INFO: creating /dev/gdrdrv inode mknod: missing operand after '0' Try 'mknod --help' for more information. chmod: cannot access '/dev/gdrdrv': No such file or directory

how to slove?

MarsMeng1994 avatar Apr 23 '25 10:04 MarsMeng1994

@MarsMeng1994, I encountered the same error, these .ko files could be found in "/lib/modules/5.15.0-138-generic/updates/dkms/" after you installed gdrdrv.

I change the insmod.sh line 28, pointer gdrdrv.ko to above path, then it runs ok.

sudo /sbin/insmod /lib/modules/5.4.0-162-generic/updates/dkms/gdrdrv.ko dbg_enabled=0 info_enabled=0 use_persistent_mapping=0

Sun1Plus avatar Apr 23 '25 11:04 Sun1Plus

@MarsMeng1994, I encountered the same error, these .ko files could be found in "/lib/modules/5.15.0-138-generic/updates/dkms/" after you installed gdrdrv.

I change the insmod.sh line 28, pointer gdrdrv.ko to above path, then it runs ok.

sudo /sbin/insmod /lib/modules/5.4.0-162-generic/updates/dkms/gdrdrv.ko dbg_enabled=0 info_enabled=0 use_persistent_mapping=0

thk, i try but not work, may it a permision error?

MarsMeng1994 avatar Apr 24 '25 03:04 MarsMeng1994

@MarsMeng1994, I encountered the same error, these .ko files could be found in "/lib/modules/5.15.0-138-generic/updates/dkms/" after you installed gdrdrv. I change the insmod.sh line 28, pointer gdrdrv.ko to above path, then it runs ok.

sudo /sbin/insmod /lib/modules/5.4.0-162-generic/updates/dkms/gdrdrv.ko dbg_enabled=0 info_enabled=0 use_persistent_mapping=0

thk, i try but not work, may it a permision error?

could you show the complete output include "dpkg -I gdrdrv-xxx" & "./insmod.sh"?

Sun1Plus avatar Apr 24 '25 04:04 Sun1Plus

@MarsMeng1994, I encountered the same error, these .ko files could be found in "/lib/modules/5.15.0-138-generic/updates/dkms/" after you installed gdrdrv. I change the insmod.sh line 28, pointer gdrdrv.ko to above path, then it runs ok.

sudo /sbin/insmod /lib/modules/5.4.0-162-generic/updates/dkms/gdrdrv.ko dbg_enabled=0 info_enabled=0 use_persistent_mapping=0

thk, i try but not work, may it a permision error?

could you show the complete output include "dpkg -I gdrdrv-xxx" & "./insmod.sh"?

$ sudo NVIDIA_SRC_DIR=/sharedata/msm/workspace/DeepEP/third-party/NVIDIA-Linux-x86_64-550.144.03/kernel/nvidia/ dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb

(Reading database ... 61771 files and directories currently installed.) Preparing to unpack gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb ... Module gdrdrv-2.4.4 for kernel 5.15.0-124-generic (x86_64). Before uninstall, this module version was ACTIVE on this kernel.

gdrdrv.ko:

  • Uninstallation
    • Deleting from: /lib/modules/5.15.0-124-generic/updates/dkms/
  • Original module
    • No original module was found for this module on this kernel.
    • Use the dkms install command to reinstall any previous module version.

depmod... Deleting module gdrdrv-2.4.4 completely from the DKMS tree. Unpacking gdrdrv-dkms:amd64 (2.4.4) over (2.4.4) ... Setting up gdrdrv-dkms:amd64 (2.4.4) ... Loading new gdrdrv-2.4.4 DKMS files... Building for 5.15.0-124-generic Building for architecture x86_64 Building initial module for 5.15.0-124-generic Done.

gdrdrv.ko: Running module version sanity check.

  • Original module
    • No original module exists within this kernel
  • Installation
    • Installing to /lib/modules/5.15.0-124-generic/updates/dkms/

depmod... invoke-rc.d: could not determine current runlevel invoke-rc.d: policy-rc.d denied execution of restart.

$ sudo ./insmod.sh

insmod: ERROR: could not insert module /lib/modules/5.15.0-124-generic/updates/dkms/gdrdrv.ko: Operation not permitted INFO: driver major is INFO: creating /dev/gdrdrv inode mknod: missing operand after '0' Try 'mknod --help' for more information. chmod: cannot access '/dev/gdrdrv': No such file or directory

$ cat ./insmod.sh

THIS_DIR=$(dirname $0)

grep gdrdrv /proc/devices >/dev/null && sudo /sbin/rmmod gdrdrv

sudo /sbin/insmod /lib/modules/5.15.0-124-generic/updates/dkms/gdrdrv.ko dbg_enabled=0 info_enabled=0 use_persistent_mapping=0

major=fgrep gdrdrv /proc/devices | cut -b 1-4 echo "INFO: driver major is $major"

if [ -e /dev/gdrdrv ]; then sudo rm /dev/gdrdrv fi

echo "INFO: creating /dev/gdrdrv inode" sudo mknod /dev/gdrdrv c $major 0 sudo chmod a+w+r /dev/gdrdrv

MarsMeng1994 avatar Apr 24 '25 05:04 MarsMeng1994

following is my output after executing "dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb"

Selecting previously unselected package gdrdrv-dkms:amd64.
(Reading database ... 93867 files and directories currently installed.)
Preparing to unpack gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb ...
Unpacking gdrdrv-dkms:amd64 (2.4.4) ...
Setting up gdrdrv-dkms:amd64 (2.4.4) ...
Loading new gdrdrv-2.4.4 DKMS files...
Building for 5.4.0-162-generic
Building for architecture x86_64
Building initial module for 5.4.0-162-generic
This system doesn't support Secure Boot
Secure Boot not enabled on this system.
Done.

gdrdrv.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-162-generic/updates/dkms/

depmod....

DKMS: install completed.
Processing triggers for systemd (245.4-4ubuntu3.22) ...

And your output is:

depmod...
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of restart.

it seems like your env didn't install gdrdrv successfully, but I didn't encounter it, I don't know how to solve it.

Sun1Plus avatar Apr 24 '25 06:04 Sun1Plus

following is my output after executing "dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb"

Selecting previously unselected package gdrdrv-dkms:amd64.
(Reading database ... 93867 files and directories currently installed.)
Preparing to unpack gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb ...
Unpacking gdrdrv-dkms:amd64 (2.4.4) ...
Setting up gdrdrv-dkms:amd64 (2.4.4) ...
Loading new gdrdrv-2.4.4 DKMS files...
Building for 5.4.0-162-generic
Building for architecture x86_64
Building initial module for 5.4.0-162-generic
This system doesn't support Secure Boot
Secure Boot not enabled on this system.
Done.

gdrdrv.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-162-generic/updates/dkms/

depmod....

DKMS: install completed.
Processing triggers for systemd (245.4-4ubuntu3.22) ...

And your output is:

depmod...
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of restart.

it seems like your env didn't install gdrdrv successfully, but I didn't encounter it, I don't know how to solve it.

I asked our company's operation and maintenance, and he said that the problem was that the privileged mode was not turned on. Then he operated it and the error was gone. Professionals do professional things.

MarsMeng1994 avatar Apr 24 '25 06:04 MarsMeng1994

After https://github.com/deepseek-ai/DeepEP/pull/201, DeepEP can be built without depending on gdrcopy.

sphish avatar Jun 25 '25 02:06 sphish