k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Cannont pass through RTX 3090 into pod; Failed to initialize NVML: could not load NVML library.

Open davidho27941 opened this issue 3 years ago • 17 comments

1. Issue or feature description

Cannot pass through RTX 3090 GPU by k8s-device-plugin(both k8s-only or helm failed.)

2. Steps to reproduce the issue

My kubeadm version: 1.21.1 My kubectl version: 1.21.1 My kubelet version: 1.21.1 My CRI-O version: 1.21:1.21.1

I was trying to create a cluster using crio container runtime interface and flannel CNI.

My command for initialize cluster: sudo kubeadm init --cri-socket /var/run/crio/crio.sock --pod-network-cidr 10.244.0.0/16

Adding flannel: kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

Adding k8s-device-plugin by nvidia: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Then, the information below is the logs reported by nvidia-device-plugin-daemonset-llthp pod:

2021/08/30 06:04:38 Loading NVML
2021/08/30 06:04:38 Failed to initialize NVML: could not load NVML library.
2021/08/30 06:04:38 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 06:04:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 06:04:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 06:04:38 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

Once I try to establish a pod using the following yaml:

apiVersion: v1
kind: Pod
metadata:
  name: torch
  labels:
    app: torch
spec:
  containers:
  - name: torch
    image: nvcr.io/nvidia/pytorch:21.03-py3
    #command: [ "/bin/bash", "-c", "--" ]
    #args: [ "while true; do sleep 30; done;" ]
    ports:
      - containerPort: 8888
        protocol: TCP
    resources:
      requests:
        nvidia.com/gpu: 1
        memory: "64Mi"
        cpu: "250m"
      limits:
        nvidia.com/gpu: 1
        memory: "128Mi"
        cpu: "500m"

The Kubernetes failed to get the GPU

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  15s (x3 over 92s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

But the docker works without error when I try to run:

 docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

Output:

2021/08/30 10:38:09 Loading NVML
2021/08/30 10:38:09 Starting FS watcher.
2021/08/30 10:38:09 Starting OS watcher.
2021/08/30 10:38:09 Retreiving plugins.
2021/08/30 10:38:09 Starting GRPC server for 'nvidia.com/gpu'
2021/08/30 10:38:09 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/08/30 10:38:09 Registered device plugin for 'nvidia.com/gpu' with Kubelet

It seems docker can pass through GPU successfully but k8s do not.

Can anybody help me to figure out the problem?

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • [x] The output of nvidia-smi -a on your host
==============NVSMI LOG==============

Timestamp                                 : Mon Aug 30 18:22:17 2021
Driver Version                            : 460.73.01
CUDA Version                              : 11.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : GeForce RTX 3090
    Product Brand                         : GeForce
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-948211b6-df7a-5768-ca7b-a84e23d9404d
    Minor Number                          : 0
    VBIOS Version                         : 94.02.26.08.1C
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x220410DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x403B1458
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 1000 KB/s
        Rx Throughput                     : 1000 KB/s
    Fan Speed                             : 41 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24265 MiB
        Used                              : 1256 MiB
        Free                              : 23009 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 14 MiB
        Free                              : 242 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 1 %
        Memory                            : 10 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 48 C
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 34.64 W
        Power Limit                       : 350.00 W
        Default Power Limit               : 350.00 W
        Enforced Power Limit              : 350.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 350.00 W
    Clocks
        Graphics                          : 270 MHz
        SM                                : 270 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2692
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 73 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3028
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 160 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 5521
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 624 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 5654
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 84 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 8351
            Type                          : G
            Name                          : /usr/share/skypeforlinux/skypeforlinux --type=gpu-process --field-trial-handle=2437345894369599647,6238031376657225521,131072 --enable-features=WebComponentsV0Enabled --disable-features=CookiesWithoutSameSiteMustBeSecure,SameSiteByDefaultCookies,SpareRendererForSitePerProcess --enable-crash-reporter=97d5b09d-f9b0-4336-bc9a-fe11870fe1b3,no_channel --global-crash-keys=97d5b09d-f9b0-4336-bc9a-fe11870fe1b3,no_channel,_companyName=Skype,_productName=skypeforlinux,_version=8.73.0.92 --gpu-preferences=OAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAA== --shared-files
            Used GPU Memory               : 14 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 8560
            Type                          : G
            Name                          : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=10043073040938675921,16429150098372267894,131072 --enable-crashpad --crashpad-handler-pid=8526 --enable-crash-reporter=a844a16f-8f0f-4770-87e1-a8389ca3c415, --gpu-preferences=UAAAAAAAAAAgAAAQAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAABgAAAAAAAAAGAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files
            Used GPU Memory               : 91 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 8582
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 178 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 9139
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 9931
            Type                          : G
            Name                          : gnome-control-center
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 11503
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 64276
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 4 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 78463
            Type                          : G
            Name                          : /usr/lib/firefox/firefox
            Used GPU Memory               : 4 MiB
  • [x ] Your docker configuration file (e.g: /etc/docker/daemon.json)
{
    "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
  • [x] The k8s-device-plugin container logs
2021/08/30 06:04:38 Loading NVML
2021/08/30 06:04:38 Failed to initialize NVML: could not load NVML library.
2021/08/30 06:04:38 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 06:04:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 06:04:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 06:04:38 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
  • [x] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
 八  30 14:12:23 srv1 kubelet[108111]: I0830 14:12:23.643580  108111 eviction_manager.go:346] "Eviction manager: able to reduce resource pressure without evicting pods." resourceName="ephemeral-storage"
 八  30 14:12:23 srv1 kubelet[108111]: I0830 14:12:23.457677  108111 eviction_manager.go:425] "Eviction manager: unexpected error when attempting to reduce resource pressure" resourceName="ephemeral-storage" err="wanted to free 9223372036854775807 bytes, but freed 14575560277 bytes space with errors in image deletion: [rpc error: code = U
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404808  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ad8c213c76c5990969673d7a22ed6bce9d13e6cdd613fefd2db967a03e1cd816" size=14575560277
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404791  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 864db3a48c0a2753840a7f994873c2c5af696d6765aeb229b49e455ea5e98c4c: image is in use by a container" image="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b89
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404762  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 864db3a48c0a2753840a7f994873c2c5af696d6765aeb229b49e455ea5e98c4c: image is in use by a container" image="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e60
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404494  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="296a6d5035e2d6919249e02709a488d680ddca91357602bd65e605eac967b899" size=42585056
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404479  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by aa72c61c4181efcc0f55c70f42078481cc0af69654343aa98edd6bfac63290ba: image is in use by a container" image="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9399d17c69d
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404467  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by aa72c61c4181efcc0f55c70f42078481cc0af69654343aa98edd6bfac63290ba: image is in use by a container" image="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.404230  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="8522d622299ca431311ac69992419c956fbaca6fa8289c76810c9399d17c69de" size=68899837
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404212  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by b53c6818067e6b95f5e4689d991f86524bb4e47baec455a0211168b321e1af1b: image is in use by a container" image="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd466bd8b4f7ef
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.404187  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by b53c6818067e6b95f5e4689d991f86524bb4e47baec455a0211168b321e1af1b: image is in use by a container" image="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd46
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403939  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="37b8c3899b153afc2c7e65e1939330654276560b8b5f6dffdfd466bd8b4f7ef8" size=195847465
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403932  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 83c0ea8f464dc205726d29d407f564b5115e9b80bd65bac2f087463d80ff95ed: image is in use by a container" image="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c05db18776ab
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403920  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 83c0ea8f464dc205726d29d407f564b5115e9b80bd65bac2f087463d80ff95ed: image is in use by a container" image="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c0
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403680  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="2c25d0f89db7a9dba5ed71b692b65e86b0ad9fcab1a9f94e946c05db18776ab3" size=121095258
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403673  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by ca4d555dce70b78abd85986745371d98c2028590ae058e2320ce457f5fec0b30: image is in use by a container" image="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55bd6c0adb93
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403663  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by ca4d555dce70b78abd85986745371d98c2028590ae058e2320ce457f5fec0b30: image is in use by a container" image="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403428  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="0369cf4303ffdb467dc219990960a9baa8512a54b0ad9283eaf55bd6c0adb934" size=254662613
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403422  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by d4a7103f1e4829474bab231668d0377b97fc222e2a4b4332a669e912b863175a: image is in use by a container" image="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2790004b2fe
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403412  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by d4a7103f1e4829474bab231668d0377b97fc222e2a4b4332a669e912b863175a: image is in use by a container" image="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.403187  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="993d3ec13feb2e7b7e9bd6ac4831fb0cdae7329a8e8f1e285d9f2790004b2fe3" size=51893338
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403180  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 40de580961ae274afef6eb2737f313bc8637ac21fc42fa53863a97523c07c831: image is in use by a container" image="cef7457710b1ace64357066aea33117083dfec9a023cade594cc16c7a81d936
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.403171  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 40de580961ae274afef6eb2737f313bc8637ac21fc42fa53863a97523c07c831: image is in use by a container" image="cef7457710b1ace64357066aea33117083dfec9a023cade594cc1
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402907  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="cef7457710b1ace64357066aea33117083dfec9a023cade594cc16c7a81d936b" size=126883060
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402897  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 63d4a6aaa8f530cb3e33f02af9262d2ffd20f076b5803bc1ea1f03fc29f9ebf3: image is in use by a container" image="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1b7a55424d6
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402886  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 63d4a6aaa8f530cb3e33f02af9262d2ffd20f076b5803bc1ea1f03fc29f9ebf3: image is in use by a container" image="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402498  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ef4bce0a7569b4fa83a559717c608c076a2c9d30361eb059ea4e1b7a55424d68" size=105130216
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402486  108111 kuberuntime_image.go:122] "Failed to remove image" err="rpc error: code = Unknown desc = Image used by 9270341c09e80de42955681f04bb0baaac9f931e7e4eb6aa400a7419337e107b: image is in use by a container" image="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c45
 八  30 14:12:20 srv1 kubelet[108111]: E0830 14:12:20.402467  108111 remote_image.go:136] "RemoveImage from image service failed" err="rpc error: code = Unknown desc = Image used by 9270341c09e80de42955681f04bb0baaac9f931e7e4eb6aa400a7419337e107b: image is in use by a container" image="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.402130  108111 image_gc_manager.go:375] "Removing image to free bytes" imageID="ed210e3e4a5bae1237f1bb44d72a05a2f1e5c6bfe7a7e73da179e2534269c459" size=689969
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.400313  108111 image_gc_manager.go:321] "Attempting to delete unused images"
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.398657  108111 container_gc.go:85] "Attempting to delete unused containers"
 八  30 14:12:20 srv1 kubelet[108111]: I0830 14:12:20.398622  108111 eviction_manager.go:339] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
 八  30 14:12:10 srv1 kubelet[108111]: I0830 14:12:10.205926  108111 eviction_manager.go:391] "Eviction manager: unable to evict any pods from the node"

Additional information that might help better understand your environment and reproduce the bug:

  • [x] Docker version from docker version
 Client: Docker Engine - Community
 Version:           20.10.0
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        7287ab3
 Built:             Tue Dec  8 18:59:53 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.0
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       eeddea2
  Built:            Tue Dec  8 18:57:44 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 nvidia:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • [ ] Docker command, image and tag used
  • [x] Kernel version from uname -a
Linux srv1 5.4.0-56-generic #62~18.04.1-Ubuntu SMP Tue Nov 24 10:07:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • [ ] Any relevant kernel output lines from dmesg
  • [x] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
||/ Name                                                                            Version                                      Architecture                                 Description
+++-===============================================================================-============================================-============================================-===================================================================================================================================================================
un  libgldispatch0-nvidia                                                           <none>                                       <none>                                       (no description available)
ii  libnvidia-cfg1-460:amd64                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                                                              <none>                                       <none>                                       (no description available)
un  libnvidia-common                                                                <none>                                       <none>                                       (no description available)
ii  libnvidia-common-460                                                            460.73.01-0ubuntu1                           all                                          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-460:amd64                                                     460.73.01-0ubuntu1                           amd64                                        NVIDIA libcompute package
ii  libnvidia-container-tools                                                       1.4.0-1                                      amd64                                        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                                                      1.4.0-1                                      amd64                                        NVIDIA container runtime library
un  libnvidia-decode                                                                <none>                                       <none>                                       (no description available)
ii  libnvidia-decode-460:amd64                                                      460.73.01-0ubuntu1                           amd64                                        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                                                                <none>                                       <none>                                       (no description available)
ii  libnvidia-encode-460:amd64                                                      460.73.01-0ubuntu1                           amd64                                        NVENC Video Encoding runtime library
un  libnvidia-extra                                                                 <none>                                       <none>                                       (no description available)
ii  libnvidia-extra-460:amd64                                                       460.73.01-0ubuntu1                           amd64                                        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                                                                  <none>                                       <none>                                       (no description available)
ii  libnvidia-fbc1-460:amd64                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                                                                    <none>                                       <none>                                       (no description available)
ii  libnvidia-gl-460:amd64                                                          460.73.01-0ubuntu1                           amd64                                        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ifr1                                                                  <none>                                       <none>                                       (no description available)
ii  libnvidia-ifr1-460:amd64                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA OpenGL-based Inband Frame Readback runtime library
un  libnvidia-ml1                                                                   <none>                                       <none>                                       (no description available)
un  nvidia-304                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-340                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-384                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-390                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-common                                                                   <none>                                       <none>                                       (no description available)
ii  nvidia-compute-utils-460                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA compute utilities
ii  nvidia-container-runtime                                                        3.5.0-1                                      amd64                                        NVIDIA container runtime
un  nvidia-container-runtime-hook                                                   <none>                                       <none>                                       (no description available)
ii  nvidia-container-toolkit                                                        1.5.1-1                                      amd64                                        NVIDIA container runtime hook
ii  nvidia-cuda-dev                                                                 9.1.85-3ubuntu1                              amd64                                        NVIDIA CUDA development files
ii  nvidia-cuda-doc                                                                 9.1.85-3ubuntu1                              all                                          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                                                                 9.1.85-3ubuntu1                              amd64                                        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                                                             9.1.85-3ubuntu1                              amd64                                        NVIDIA CUDA development toolkit
ii  nvidia-dkms-460                                                                 460.73.01-0ubuntu1                           amd64                                        NVIDIA DKMS package
un  nvidia-dkms-kernel                                                              <none>                                       <none>                                       (no description available)
un  nvidia-driver                                                                   <none>                                       <none>                                       (no description available)
ii  nvidia-driver-460                                                               460.73.01-0ubuntu1                           amd64                                        NVIDIA driver metapackage
un  nvidia-driver-binary                                                            <none>                                       <none>                                       (no description available)
un  nvidia-kernel-common                                                            <none>                                       <none>                                       (no description available)
ii  nvidia-kernel-common-460                                                        460.73.01-0ubuntu1                           amd64                                        Shared files used with the kernel module
un  nvidia-kernel-source                                                            <none>                                       <none>                                       (no description available)
ii  nvidia-kernel-source-460                                                        460.73.01-0ubuntu1                           amd64                                        NVIDIA kernel source package
un  nvidia-legacy-304xx-vdpau-driver                                                <none>                                       <none>                                       (no description available)
un  nvidia-legacy-340xx-vdpau-driver                                                <none>                                       <none>                                       (no description available)
un  nvidia-libopencl1                                                               <none>                                       <none>                                       (no description available)
un  nvidia-libopencl1-dev                                                           <none>                                       <none>                                       (no description available)
ii  nvidia-modprobe                                                                 465.19.01-0ubuntu1                           amd64                                        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-dev:amd64                                                         9.1.85-3ubuntu1                              amd64                                        NVIDIA OpenCL development files
un  nvidia-opencl-icd                                                               <none>                                       <none>                                       (no description available)
un  nvidia-persistenced                                                             <none>                                       <none>                                       (no description available)
ii  nvidia-prime                                                                    0.8.16~0.18.04.1                             all                                          Tools to enable NVIDIA's Prime
ii  nvidia-profiler                                                                 9.1.85-3ubuntu1                              amd64                                        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                                                                 465.19.01-0ubuntu1                           amd64                                        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary                                                          <none>                                       <none>                                       (no description available)
un  nvidia-smi                                                                      <none>                                       <none>                                       (no description available)
un  nvidia-utils                                                                    <none>                                       <none>                                       (no description available)
ii  nvidia-utils-460                                                                460.73.01-0ubuntu1                           amd64                                        NVIDIA driver support binaries
un  nvidia-vdpau-driver                                                             <none>                                       <none>                                       (no description available)
ii  nvidia-visual-profiler                                                          9.1.85-3ubuntu1                              amd64                                        NVIDIA Visual Profiler for CUDA and OpenCL
ii  xserver-xorg-video-nvidia-460                                                   460.73.01-0ubuntu1                           amd64                                        NVIDIA binary Xorg driver
  • [x] NVIDIA container library version from nvidia-container-cli -V
version: 1.4.0
build date: 2021-04-24T14:25+00:00
build revision: 704a698b7a0ceec07a48e56c37365c741718c2df
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

davidho27941 avatar Aug 30 '21 10:08 davidho27941

@davidho27941 I see from your description that you are installing version 1.0.0-beta4 of the device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

The versioning of the NVIDIA Device plugin is inconsistent in that v0.9.0 is the latest release: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.9.0

Could see whether using this (or one of the more recent releases) addresses your issue?

elezar avatar Aug 30 '21 10:08 elezar

0> @davidho27941 I see from your description that you are installing version 1.0.0-beta4 of the device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

The versioning of the NVIDIA Device plugin is inconsistent in that v0.9.0 is the latest release: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.9.0

Could see whether using this (or one of the more recent releases) addresses your issue?

Hi @elezar,

Actually, I also failed with that version.....

The v0.9.0 one also failed to load NVML library.

I just do a check, the output is shown below:

2021/08/30 10:44:12 Loading NVML
2021/08/30 10:44:12 Failed to initialize NVML: could not load NVML library.
2021/08/30 10:44:12 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/30 10:44:12 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/30 10:44:12 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/30 10:44:12 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

The output of kubectl describe pod -n kubesystem nvidia-device-plugin-daemonset-rwng2

Name:                 nvidia-device-plugin-daemonset-rwng2
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 srv1/192.168.50.248
Start Time:           Mon, 30 Aug 2021 18:44:02 +0800
Labels:               controller-revision-hash=9d47c6878
                      name=nvidia-device-plugin-ds
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.244.0.8
IPs:
  IP:           10.244.0.8
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  cri-o://7a4820d5ba7d657245b1a8300519bcfda0a1ccd73d33d848f7762ba5e19a4b47
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.9.0
    Image ID:      nvcr.io/nvidia/k8s-device-plugin@sha256:964847cc3fd85ead286be1d74d961f53d638cd4875af51166178b17bba90192f
    Port:          <none>
    Host Port:     <none>
    Args:
      --fail-on-init-error=false
    State:          Running
      Started:      Mon, 30 Aug 2021 18:44:12 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pv8tx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-pv8tx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  44s   default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-rwng2 to srv1
  Normal  Pulling    43s   kubelet            Pulling image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
  Normal  Pulled     35s   kubelet            Successfully pulled image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0" in 8.591371132s
  Normal  Created    34s   kubelet            Created container nvidia-device-plugin-ctr
  Normal  Started    34s   kubelet            Started container nvidia-device-plugin-ctr

Best regards, David

davidho27941 avatar Aug 30 '21 10:08 davidho27941

You also mentioned:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

Does this mean that K8s is using crio to launch containers? Has crio been configured to use the NVIDIA Container Runtime or does it have the NVIDIA Container Tooklit / Hook configured?

elezar avatar Aug 30 '21 10:08 elezar

You also mentioned:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

Does this mean that K8s is using crio to launch containers? Has crio been configured to use the NVIDIA Container Runtime or does it have the NVIDIA Container Tooklit / Hook configured?

Actually, I am not sure about this, but I also failed to run with containerd...

Following is my configuration for containerd:

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  tcp_address = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ""
  uid = 0
  gid = 0

[debug]
  address = ""
  uid = 0
  gid = 0
  level = ""

[metrics]
  address = ""
  grpc_histogram = false

[cgroup]
  path = ""

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[plugins]
  [plugins."io.containerd.gc.v1.scheduler"]
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = "0s"
    startup_delay = "100ms"
  [plugins."io.containerd.grpc.v1.cri"]
    disable_tcp_service = true
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    stream_idle_timeout = "4h0m0s"
    enable_selinux = false
    selinux_category_range = 1024
    sandbox_image = "k8s.gcr.io/pause:3.2"
    stats_collect_period = 10
    systemd_cgroup = false
    enable_tls_streaming = false
    max_container_log_line_size = 16384
    disable_cgroup = false
    disable_apparmor = false
    restrict_oom_score_adj = false
    max_concurrent_downloads = 3
    disable_proc_mount = false
    unset_seccomp_profile = ""
    tolerate_missing_hugetlb_controller = true
    disable_hugetlb_controller = true
    ignore_image_defined_volumes = false
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
      default_runtime_name = "nvidia"
      no_pivot = false
      disable_snapshot_annotations = false
      discard_unpacked_layers = false
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
          base_runtime_spec = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            SystemdCgroup = true
            BinaryName="/usr/bin/nvidia-container-runtime"
    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      max_conf_num = 1
      conf_template = ""
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = ""
    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""
  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"
  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"
  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"
  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false
  [plugins."io.containerd.runtime.v1.linux"]
    shim = "containerd-shim"
    runtime = "runc"
    runtime_root = ""
    no_shim = false
    shim_debug = false
  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/amd64"]
  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]
  [plugins."io.containerd.snapshotter.v1.devmapper"]
    root_path = ""
    pool_name = ""
    base_image_size = ""
    async_remove = false

Best regards, David

davidho27941 avatar Aug 30 '21 11:08 davidho27941

Since the image works with docker, it would appear as if your NVIDIA Container Toolkit installation is at least sane. In oder to debug this further, could you uncomment the #debug = lines in /etc/nvidia-container-runtime/config.toml. Then run the nvidia-smi command in a container (ubuntu should do) using ctr and attach the contents of /var/log/nvidia-container-*.log to the issue. If you're able to clear those logs and then also include them when running the container using docker, that could provide a point for comparison.

elezar avatar Aug 30 '21 11:08 elezar

Since the image works with docker, it would appear as if your NVIDIA Container Toolkit installation is at least sane. In oder to debug this further, could you uncomment the #debug = lines in /etc/nvidia-container-runtime/config.toml. Then run the nvidia-smi command in a container (ubuntu should do) using ctr and attach the contents of /var/log/nvidia-container-*.log to the issue. If you're able to clear those logs and then also include them when running the container using docker, that could provide a point for comparison.

Hi

The following output is the logs when I try to run docker run nvidia/cuda:11.1-base nvidia-smi

Partial output of /var/log/nvidia-container-toolkit.log:

-- WARNING, the following logs are for debugging purposes only --

I0830 11:16:34.890421 49163 nvc.c:372] initializing library context (version=1.4.0, build=704a698b7a0ceec07a48e56c37365c741718c2df)
I0830 11:16:34.890480 49163 nvc.c:346] using root /
I0830 11:16:34.890488 49163 nvc.c:347] using ldcache /etc/ld.so.cache
I0830 11:16:34.890496 49163 nvc.c:348] using unprivileged user 65534:65534
I0830 11:16:34.890515 49163 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0830 11:16:34.890627 49163 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I0830 11:16:34.892652 49169 nvc.c:274] loading kernel module nvidia
I0830 11:16:34.892812 49169 nvc.c:278] running mknod for /dev/nvidiactl
I0830 11:16:34.892841 49169 nvc.c:282] running mknod for /dev/nvidia0
I0830 11:16:34.892858 49169 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I0830 11:16:34.898656 49169 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0830 11:16:34.898750 49169 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0830 11:16:34.900715 49169 nvc.c:292] loading kernel module nvidia_uvm
I0830 11:16:34.900784 49169 nvc.c:296] running mknod for /dev/nvidia-uvm
I0830 11:16:34.900846 49169 nvc.c:301] loading kernel module nvidia_modeset
I0830 11:16:34.900908 49169 nvc.c:305] running mknod for /dev/nvidia-modeset
I0830 11:16:34.901104 49171 driver.c:101] starting driver service
I0830 11:16:34.903318 49163 nvc_container.c:388] configuring container with 'compute utility supervised'
I0830 11:16:34.903488 49163 nvc_container.c:236] selecting /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/local/cuda-11.1/compat/libcuda.so.455.45.01
I0830 11:16:34.903523 49163 nvc_container.c:236] selecting /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/local/cuda-11.1/compat/libnvidia-ptxjitcompiler.so.455.45.01
I0830 11:16:34.903657 49163 nvc_container.c:408] setting pid to 49105
I0830 11:16:34.903665 49163 nvc_container.c:409] setting rootfs to /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged
I0830 11:16:34.903670 49163 nvc_container.c:410] setting owner to 0:0
I0830 11:16:34.903675 49163 nvc_container.c:411] setting bins directory to /usr/bin
I0830 11:16:34.903680 49163 nvc_container.c:412] setting libs directory to /usr/lib/x86_64-linux-gnu
I0830 11:16:34.903685 49163 nvc_container.c:413] setting libs32 directory to /usr/lib/i386-linux-gnu
I0830 11:16:34.903690 49163 nvc_container.c:414] setting cudart directory to /usr/local/cuda
I0830 11:16:34.903695 49163 nvc_container.c:415] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0830 11:16:34.903700 49163 nvc_container.c:416] setting mount namespace to /proc/49105/ns/mnt
I0830 11:16:34.903705 49163 nvc_container.c:418] setting devices cgroup to /sys/fs/cgroup/devices/system.slice/docker-19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd.scope
I0830 11:16:34.903712 49163 nvc_info.c:676] requesting driver information with ''
I0830 11:16:34.904962 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.460.73.01
I0830 11:16:34.905038 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.73.01
I0830 11:16:34.905070 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.73.01
I0830 11:16:34.905103 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.73.01
I0830 11:16:34.905145 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.73.01
I0830 11:16:34.905186 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.73.01
I0830 11:16:34.905215 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.460.73.01
I0830 11:16:34.905249 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01
I0830 11:16:34.905290 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.73.01
I0830 11:16:34.905354 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.73.01
I0830 11:16:34.905383 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
I0830 11:16:34.905413 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.73.01
I0830 11:16:34.905442 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.73.01
I0830 11:16:34.905483 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.460.73.01
I0830 11:16:34.905525 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.73.01
I0830 11:16:34.905555 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.73.01
I0830 11:16:34.905586 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01
I0830 11:16:34.905625 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.73.01
I0830 11:16:34.905653 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.73.01
I0830 11:16:34.905694 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.460.73.01
I0830 11:16:34.906212 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01
I0830 11:16:34.906391 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.73.01
I0830 11:16:34.906423 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.73.01
I0830 11:16:34.906454 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.73.01
I0830 11:16:34.906486 49163 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.73.01
W0830 11:16:34.906560 49163 nvc_info.c:350] missing library libnvidia-nscq.so
W0830 11:16:34.906567 49163 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0830 11:16:34.906573 49163 nvc_info.c:350] missing library libvdpau_nvidia.so
W0830 11:16:34.906579 49163 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0830 11:16:34.906585 49163 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0830 11:16:34.906592 49163 nvc_info.c:354] missing compat32 library libnvidia-nscq.so
W0830 11:16:34.906598 49163 nvc_info.c:354] missing compat32 library libcuda.so
W0830 11:16:34.906604 49163 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0830 11:16:34.906610 49163 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0830 11:16:34.906616 49163 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0830 11:16:34.906622 49163 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0830 11:16:34.906628 49163 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0830 11:16:34.906634 49163 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0830 11:16:34.906640 49163 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0830 11:16:34.906646 49163 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0830 11:16:34.906652 49163 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0830 11:16:34.906658 49163 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0830 11:16:34.906664 49163 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0830 11:16:34.906670 49163 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0830 11:16:34.906676 49163 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0830 11:16:34.906682 49163 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0830 11:16:34.906688 49163 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0830 11:16:34.906694 49163 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0830 11:16:34.906700 49163 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0830 11:16:34.906706 49163 nvc_info.c:354] missing compat32 library libnvoptix.so
W0830 11:16:34.906712 49163 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0830 11:16:34.906718 49163 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0830 11:16:34.906729 49163 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0830 11:16:34.906735 49163 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0830 11:16:34.906741 49163 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0830 11:16:34.906747 49163 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0830 11:16:34.907014 49163 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I0830 11:16:34.907032 49163 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I0830 11:16:34.907049 49163 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I0830 11:16:34.907074 49163 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I0830 11:16:34.907090 49163 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
W0830 11:16:34.907173 49163 nvc_info.c:376] missing binary nv-fabricmanager
I0830 11:16:34.907195 49163 nvc_info.c:438] listing device /dev/nvidiactl
I0830 11:16:34.907201 49163 nvc_info.c:438] listing device /dev/nvidia-uvm
I0830 11:16:34.907207 49163 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0830 11:16:34.907213 49163 nvc_info.c:438] listing device /dev/nvidia-modeset
I0830 11:16:34.907236 49163 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0830 11:16:34.907256 49163 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket
W0830 11:16:34.907269 49163 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0830 11:16:34.907276 49163 nvc_info.c:733] requesting device information with ''
I0830 11:16:34.913002 49163 nvc_info.c:623] listing device /dev/nvidia0 (GPU-948211b6-df7a-5768-ca7b-a84e23d9404d at 00000000:01:00.0)
I0830 11:16:34.913062 49163 nvc_mount.c:344] mounting tmpfs at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/proc/driver/nvidia
I0830 11:16:34.913573 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-smi
I0830 11:16:34.913629 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-debugdump
I0830 11:16:34.913678 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-persistenced
I0830 11:16:34.913723 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-cuda-mps-control
I0830 11:16:34.913769 49163 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/bin/nvidia-cuda-mps-server
I0830 11:16:34.913912 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01
I0830 11:16:34.913967 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01
I0830 11:16:34.914014 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01
I0830 11:16:34.914060 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.73.01
I0830 11:16:34.914109 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.73.01
I0830 11:16:34.914165 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.73.01
I0830 11:16:34.914212 49163 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.73.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.73.01
I0830 11:16:34.914241 49163 nvc_mount.c:524] creating symlink /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
I0830 11:16:34.914325 49163 nvc_mount.c:112] mounting /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/local/cuda-11.1/compat/libcuda.so.455.45.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01
I0830 11:16:34.914378 49163 nvc_mount.c:112] mounting /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/local/cuda-11.1/compat/libnvidia-ptxjitcompiler.so.455.45.01 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.455.45.01
I0830 11:16:34.914499 49163 nvc_mount.c:239] mounting /run/nvidia-persistenced/socket at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/run/nvidia-persistenced/socket
I0830 11:16:34.914547 49163 nvc_mount.c:208] mounting /dev/nvidiactl at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/dev/nvidiactl
I0830 11:16:34.914582 49163 nvc_mount.c:499] whitelisting device node 195:255
I0830 11:16:34.914624 49163 nvc_mount.c:208] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/dev/nvidia-uvm
I0830 11:16:34.914649 49163 nvc_mount.c:499] whitelisting device node 508:0
I0830 11:16:34.914680 49163 nvc_mount.c:208] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/dev/nvidia-uvm-tools
I0830 11:16:34.914704 49163 nvc_mount.c:499] whitelisting device node 508:1
I0830 11:16:34.914751 49163 nvc_mount.c:208] mounting /dev/nvidia0 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/dev/nvidia0
I0830 11:16:34.914823 49163 nvc_mount.c:412] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged/proc/driver/nvidia/gpus/0000:01:00.0
I0830 11:16:34.914850 49163 nvc_mount.c:499] whitelisting device node 195:0
I0830 11:16:34.914869 49163 nvc_ldcache.c:360] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/664bb5bcdb871aea189854c432aab88bc4e536e685839e647ff5b3f0a09aa188/merged
I0830 11:16:34.963847 49163 nvc.c:423] shutting down library context
I0830 11:16:34.964461 49171 driver.c:163] terminating driver service
I0830 11:16:34.964805 49163 driver.c:203] driver service terminated successfully

Partial output of /var/log/nvidia-container-toolkit.log:

2021/08/30 19:14:07 No modification required
2021/08/30 19:14:07 Forwarding command to runtime
2021/08/30 19:14:07 Bundle directory path is empty, using working directory.
2021/08/30 19:14:07 Using bundle directory: /
2021/08/30 19:14:07 Using OCI specification file path: /config.json
2021/08/30 19:14:07 Looking for runtime binary 'docker-runc'
2021/08/30 19:14:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/30 19:14:07 Looking for runtime binary 'runc'
2021/08/30 19:14:07 Found runtime binary '/usr/bin/runc'
2021/08/30 19:14:07 Running /usr/bin/nvidia-container-runtime

2021/08/30 19:14:07 No modification required
2021/08/30 19:14:07 Forwarding command to runtime
2021/08/30 19:16:07 Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd
2021/08/30 19:16:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd/config.json
2021/08/30 19:16:07 Looking for runtime binary 'docker-runc'
2021/08/30 19:16:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/30 19:16:07 Looking for runtime binary 'runc'
2021/08/30 19:16:07 Found runtime binary '/usr/bin/runc'
2021/08/30 19:16:07 Running /usr/bin/nvidia-container-runtime

2021/08/30 19:16:07 'create' command detected; modification required
2021/08/30 19:16:07 prestart hook path: /usr/bin/nvidia-container-runtime-hook

2021/08/30 19:16:07 Forwarding command to runtime
2021/08/30 19:16:07 Bundle directory path is empty, using working directory.
2021/08/30 19:16:07 Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd
2021/08/30 19:16:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd/config.json
2021/08/30 19:16:07 Looking for runtime binary 'docker-runc'
2021/08/30 19:16:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/30 19:16:07 Looking for runtime binary 'runc'
2021/08/30 19:16:07 Found runtime binary '/usr/bin/runc'
2021/08/30 19:16:07 Running /usr/bin/nvidia-container-runtime

2021/08/30 19:16:07 No modification required
2021/08/30 19:16:07 Forwarding command to runtime
2021/08/30 19:16:07 Bundle directory path is empty, using working directory.
2021/08/30 19:16:07 Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd
2021/08/30 19:16:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/19816aa85232e1fc7d31970489ccced5c68acbfc9f97d625ffc17387bb2e77fd/config.json
2021/08/30 19:16:07 Looking for runtime binary 'docker-runc'
2021/08/30 19:16:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/30 19:16:07 Looking for runtime binary 'runc'
2021/08/30 19:16:07 Found runtime binary '/usr/bin/runc'
2021/08/30 19:16:07 Running /usr/bin/nvidia-container-runtime

2021/08/30 19:16:07 No modification required
2021/08/30 19:16:07 Forwarding command to runtime

Best regards, David

davidho27941 avatar Aug 30 '21 11:08 davidho27941

@elezar

Hi,

Maybe my description got a little bit mist to you.

The recent status is:

I start a docker container using docker run --security-opt=no-new-privileges --cap-drop=ALL --restart always --network=none -dit -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0 to start a socket for k8s service (as the steps address in https://github.com/NVIDIA/k8s-device-plugin#with-docker).

This container can load NVML library successfully and a nvidia.com/gpu can be registered to k8s.

2021/08/31 05:06:32 Loading NVML
2021/08/31 05:06:32 Starting FS watcher.
2021/08/31 05:06:32 Starting OS watcher.
2021/08/31 05:06:32 Retreiving plugins.
2021/08/31 05:06:32 Starting GRPC server for 'nvidia.com/gpu'
2021/08/31 05:06:32 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/08/31 05:06:32 Registered device plugin for 'nvidia.com/gpu' with Kubelet

But the pod created for device plugin still cannot load the NVML library .

The output of kubectl logs nvidia-device-plugin-daemonset-4ddpg -n kube-system:

2021/08/31 06:46:42 Loading NVML
2021/08/31 06:46:42 Failed to initialize NVML: could not load NVML library.
2021/08/31 06:46:42 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/08/31 06:46:42 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/08/31 06:46:42 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/08/31 06:46:42 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

Now, I can create a pod with the following config and without a 0/1 nodes are available: 1 Insufficient nvidia.com/gpu error message.

apiVersion: v1
kind: Pod
metadata:
  name: torch
  labels:
    app: torch
spec:
  containers:
  - name: torch
    image: nvcr.io/nvidia/pytorch:21.03-py3
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
    ports:
      - containerPort: 8888
        protocol: TCP
    resources:
      requests:
        nvidia.com/gpu: 1
        memory: "64Mi"
        cpu: "250m"
        ephemeral-storage: "5G"
      limits:
        nvidia.com/gpu: 1
        memory: "128Mi"
        cpu: "500m"
        ephemeral-storage: "10G"
    volumeMounts:
      - mountPath: "/data"
        name: test-volume
  volumes: 
    - name: test-volume
      hostPath: 
        path: "/home/david/jupyter_hub"
        type: Directory

But I cannot run nvidia-smi to fetch the gpu status and the command torch.cuda.is_available() also return a False that telling me it cannot fetch a gpu to run.

Do you have any idea about this?

Many thanks, David

davidho27941 avatar Aug 31 '21 07:08 davidho27941

@davidho27941 thanks for the additional information. You mentioned in your description that k8s is configured to launch containers using crio:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

This means that crio needs to be configured to use the nvidia-container-runtime or have the nvidia-container-toolkit installed as a prestart hook. You also mentioned that the container failed to launch with containerd.

Could you repeat the command:

docker run --security-opt=no-new-privileges --cap-drop=ALL --restart always --network=none -dit -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

using ctr instead of docker:

ctr run --rm -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

And include the latest lines of /var/log/nvidia-container-*.log for the failed container.

elezar avatar Aug 31 '21 07:08 elezar

@davidho27941 thanks for the additional information. You mentioned in your description that k8s is configured to launch containers using crio:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

This means that crio needs to be configured to use the nvidia-container-runtime or have the nvidia-container-toolkit installed as a prestart hook. You also mentioned that the container failed to launch with containerd.

Could you repeat the command:

docker run --security-opt=no-new-privileges --cap-drop=ALL --restart always --network=none -dit -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

using ctr instead of docker:

ctr run --rm -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

And include the latest lines of /var/log/nvidia-container-*.log for the failed container.

Hi @elezar ,

The recent configuration is running with contained. Base on your previous comment, I initialize my cluster with contained with the config shown in https://github.com/NVIDIA/k8s-device-plugin/issues/263#issuecomment-908247909

Many thanks, David

davidho27941 avatar Aug 31 '21 07:08 davidho27941

@davidho27941 thanks for the additional information. You mentioned in your description that k8s is configured to launch containers using crio:

I was trying to create a cluster using crio container runtime interface and flannel CNI.

This means that crio needs to be configured to use the nvidia-container-runtime or have the nvidia-container-toolkit installed as a prestart hook. You also mentioned that the container failed to launch with containerd.

Could you repeat the command:

docker run --security-opt=no-new-privileges --cap-drop=ALL --restart always --network=none -dit -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

using ctr instead of docker:

ctr run --rm -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:v0.9.0

And include the latest lines of /var/log/nvidia-container-*.log for the failed container.

Hi

I ran the command and got the following outputs.

The output of /var/logs/nvidia-container-runtime.log

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Bundle directory path is empty, using working directory.
2021/08/31 15:56:07 Using bundle directory: /
2021/08/31 15:56:07 Using OCI specification file path: /config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Bundle directory path is empty, using working directory.
2021/08/31 15:56:07 Using bundle directory: /
2021/08/31 15:56:07 Using OCI specification file path: /config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Bundle directory path is empty, using working directory.
2021/08/31 15:56:07 Using bundle directory: /
2021/08/31 15:56:07 Using OCI specification file path: /config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Using bundle directory: /run/containerd/io.containerd.runtime.v1.linux/moby/ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6
2021/08/31 15:56:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v1.linux/moby/ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6/config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 'create' command detected; modification required
2021/08/31 15:56:07 prestart hook path: /usr/bin/nvidia-container-runtime-hook

2021/08/31 15:56:07 Forwarding command to runtime
2021/08/31 15:56:07 Bundle directory path is empty, using working directory.
2021/08/31 15:56:07 Using bundle directory: /run/containerd/io.containerd.runtime.v1.linux/moby/ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6
2021/08/31 15:56:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v1.linux/moby/ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6/config.json
2021/08/31 15:56:07 Looking for runtime binary 'docker-runc'
2021/08/31 15:56:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/08/31 15:56:07 Looking for runtime binary 'runc'
2021/08/31 15:56:07 Found runtime binary '/usr/bin/runc'
2021/08/31 15:56:07 Running /usr/bin/nvidia-container-runtime

2021/08/31 15:56:07 No modification required
2021/08/31 15:56:07 Forwarding command to runtime

The output of /var/log/nvidia-container-toolkit.log :


-- WARNING, the following logs are for debugging purposes only --

I0831 07:56:43.428069 74570 nvc.c:372] initializing library context (version=1.4.0, build=704a698b7a0ceec07a48e56c37365c741718c2df)
I0831 07:56:43.428116 74570 nvc.c:346] using root /
I0831 07:56:43.428125 74570 nvc.c:347] using ldcache /etc/ld.so.cache
I0831 07:56:43.428132 74570 nvc.c:348] using unprivileged user 65534:65534
I0831 07:56:43.428150 74570 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0831 07:56:43.428248 74570 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I0831 07:56:43.430218 74574 nvc.c:274] loading kernel module nvidia
I0831 07:56:43.430416 74574 nvc.c:278] running mknod for /dev/nvidiactl
I0831 07:56:43.430451 74574 nvc.c:282] running mknod for /dev/nvidia0
I0831 07:56:43.430474 74574 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I0831 07:56:43.437363 74574 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0831 07:56:43.437461 74574 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0831 07:56:43.439428 74574 nvc.c:292] loading kernel module nvidia_uvm
I0831 07:56:43.439484 74574 nvc.c:296] running mknod for /dev/nvidia-uvm
I0831 07:56:43.439546 74574 nvc.c:301] loading kernel module nvidia_modeset
I0831 07:56:43.439598 74574 nvc.c:305] running mknod for /dev/nvidia-modeset
I0831 07:56:43.439788 74575 driver.c:101] starting driver service
I0831 07:56:43.442007 74570 nvc_container.c:388] configuring container with 'utility supervised'
I0831 07:56:43.442226 74570 nvc_container.c:408] setting pid to 74521
I0831 07:56:43.442236 74570 nvc_container.c:409] setting rootfs to /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged
I0831 07:56:43.442242 74570 nvc_container.c:410] setting owner to 0:0
I0831 07:56:43.442248 74570 nvc_container.c:411] setting bins directory to /usr/bin
I0831 07:56:43.442254 74570 nvc_container.c:412] setting libs directory to /usr/lib/x86_64-linux-gnu
I0831 07:56:43.442260 74570 nvc_container.c:413] setting libs32 directory to /usr/lib/i386-linux-gnu
I0831 07:56:43.442265 74570 nvc_container.c:414] setting cudart directory to /usr/local/cuda
I0831 07:56:43.442271 74570 nvc_container.c:415] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0831 07:56:43.442277 74570 nvc_container.c:416] setting mount namespace to /proc/74521/ns/mnt
I0831 07:56:43.442283 74570 nvc_container.c:418] setting devices cgroup to /sys/fs/cgroup/devices/system.slice/docker-ec776b001c4d50405a2611fbae9524865b2b134adbb75b22a19d57f1859c2ec6.scope
I0831 07:56:43.442290 74570 nvc_info.c:676] requesting driver information with ''
I0831 07:56:43.443292 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.460.73.01
I0831 07:56:43.443339 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.73.01
I0831 07:56:43.443364 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.73.01
I0831 07:56:43.443390 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.73.01
I0831 07:56:43.443423 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.73.01
I0831 07:56:43.443457 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.73.01
I0831 07:56:43.443481 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.460.73.01
I0831 07:56:43.443504 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01
I0831 07:56:43.443541 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.73.01
I0831 07:56:43.443576 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.73.01
I0831 07:56:43.443615 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
I0831 07:56:43.443638 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.73.01
I0831 07:56:43.443662 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.73.01
I0831 07:56:43.443696 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.460.73.01
I0831 07:56:43.443727 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.73.01
I0831 07:56:43.443749 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.73.01
I0831 07:56:43.443772 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01
I0831 07:56:43.443802 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.73.01
I0831 07:56:43.443823 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.73.01
I0831 07:56:43.443855 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.460.73.01
I0831 07:56:43.444118 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01
I0831 07:56:43.444252 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.73.01
I0831 07:56:43.444276 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.73.01
I0831 07:56:43.444299 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.73.01
I0831 07:56:43.444324 74570 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.73.01
W0831 07:56:43.444383 74570 nvc_info.c:350] missing library libnvidia-nscq.so
W0831 07:56:43.444388 74570 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0831 07:56:43.444393 74570 nvc_info.c:350] missing library libvdpau_nvidia.so
W0831 07:56:43.444398 74570 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0831 07:56:43.444403 74570 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0831 07:56:43.444407 74570 nvc_info.c:354] missing compat32 library libnvidia-nscq.so
W0831 07:56:43.444412 74570 nvc_info.c:354] missing compat32 library libcuda.so
W0831 07:56:43.444417 74570 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0831 07:56:43.444421 74570 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0831 07:56:43.444426 74570 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0831 07:56:43.444431 74570 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0831 07:56:43.444435 74570 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0831 07:56:43.444440 74570 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0831 07:56:43.444445 74570 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0831 07:56:43.444449 74570 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0831 07:56:43.444454 74570 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0831 07:56:43.444459 74570 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0831 07:56:43.444463 74570 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0831 07:56:43.444468 74570 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0831 07:56:43.444473 74570 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0831 07:56:43.444477 74570 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0831 07:56:43.444482 74570 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0831 07:56:43.444487 74570 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0831 07:56:43.444491 74570 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0831 07:56:43.444496 74570 nvc_info.c:354] missing compat32 library libnvoptix.so
W0831 07:56:43.444501 74570 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0831 07:56:43.444505 74570 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0831 07:56:43.444510 74570 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0831 07:56:43.444518 74570 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0831 07:56:43.444523 74570 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0831 07:56:43.444528 74570 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0831 07:56:43.444737 74570 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I0831 07:56:43.444750 74570 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I0831 07:56:43.444764 74570 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I0831 07:56:43.444783 74570 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I0831 07:56:43.444796 74570 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
W0831 07:56:43.444862 74570 nvc_info.c:376] missing binary nv-fabricmanager
I0831 07:56:43.444880 74570 nvc_info.c:438] listing device /dev/nvidiactl
I0831 07:56:43.444885 74570 nvc_info.c:438] listing device /dev/nvidia-uvm
I0831 07:56:43.444889 74570 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0831 07:56:43.444894 74570 nvc_info.c:438] listing device /dev/nvidia-modeset
I0831 07:56:43.444913 74570 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0831 07:56:43.444929 74570 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket
W0831 07:56:43.444940 74570 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0831 07:56:43.444946 74570 nvc_info.c:733] requesting device information with ''
I0831 07:56:43.450627 74570 nvc_info.c:623] listing device /dev/nvidia0 (GPU-948211b6-df7a-5768-ca7b-a84e23d9404d at 00000000:01:00.0)
I0831 07:56:43.450668 74570 nvc_mount.c:344] mounting tmpfs at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/proc/driver/nvidia
I0831 07:56:43.451065 74570 nvc_mount.c:112] mounting /usr/bin/nvidia-smi at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/bin/nvidia-smi
I0831 07:56:43.451106 74570 nvc_mount.c:112] mounting /usr/bin/nvidia-debugdump at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/bin/nvidia-debugdump
I0831 07:56:43.451141 74570 nvc_mount.c:112] mounting /usr/bin/nvidia-persistenced at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/bin/nvidia-persistenced
I0831 07:56:43.451248 74570 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01 at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.73.01
I0831 07:56:43.451289 74570 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01 at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.73.01
I0831 07:56:43.451380 74570 nvc_mount.c:239] mounting /run/nvidia-persistenced/socket at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/run/nvidia-persistenced/socket
I0831 07:56:43.451422 74570 nvc_mount.c:208] mounting /dev/nvidiactl at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/dev/nvidiactl
I0831 07:56:43.451446 74570 nvc_mount.c:499] whitelisting device node 195:255
I0831 07:56:43.451485 74570 nvc_mount.c:208] mounting /dev/nvidia0 at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/dev/nvidia0
I0831 07:56:43.451536 74570 nvc_mount.c:412] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged/proc/driver/nvidia/gpus/0000:01:00.0
I0831 07:56:43.451556 74570 nvc_mount.c:499] whitelisting device node 195:0
I0831 07:56:43.451569 74570 nvc_ldcache.c:360] executing /sbin/ldconfig.real from host at /home/david/docker/overlay2/b96fdef24518f261691b4c6883d494ed8c09efe61fb42e9195f2abbc78900122/merged
I0831 07:56:43.478246 74570 nvc.c:423] shutting down library context
I0831 07:56:43.478873 74575 driver.c:163] terminating driver service
I0831 07:56:43.479198 74570 driver.c:203] driver service terminated successfully

davidho27941 avatar Aug 31 '21 08:08 davidho27941

@davidho27941 I see from your description that you are installing version 1.0.0-beta4 of the device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

The versioning of the NVIDIA Device plugin is inconsistent in that v0.9.0 is the latest release: https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.9.0

Could see whether using this (or one of the more recent releases) addresses your issue?

Fixed, I update the image version to 1.0.0-beta4 and solved the problem. Thx.

Mr-Linus avatar Dec 19 '21 07:12 Mr-Linus

@Mr-Linus note that 1.0.0-beta4 is not supported and v0.10.0 is the latest release. If you are experiencing problems with this release we should try to determine why this is.

elezar avatar Jan 10 '22 10:01 elezar

@Mr-Linus note that 1.0.0-beta4 is not supported and v0.10.0 is the latest release. If you are experiencing problems with this release we should try to determine why this is.

👌🏻 Switched to v0.10.0 and it works fine.

Mr-Linus avatar Jan 16 '22 11:01 Mr-Linus

Is there a way to run nvidia-container-runtime on io.containerd.runc.v2, not v1? I am getting the same error as OP, tried different versions of k8s-nvidia-plugin GPU on host node works fine, nvidia-smi outputs info

luckyycode avatar May 17 '22 01:05 luckyycode

Is there a way to run nvidia-container-runtime on io.containerd.runc.v2, not v1? I am getting the same error as OP, tried different versions of k8s-nvidia-plugin GPU on host node works fine, nvidia-smi outputs info

@luckyycode this seems like an unrelated issue to this thread.

Note that from the config in https://github.com/NVIDIA/k8s-device-plugin/issues/263#issuecomment-908247909 we see:

     [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            SystemdCgroup = true
            BinaryName="/usr/bin/nvidia-container-runtime"

indicating the use of the v2 shim.

It may be more useful to create a new ticket. describing the behaviour that you see and including any relevant k8s or containerd information and logs.

elezar avatar May 17 '22 05:05 elezar

@davidho27941 were you able to resolve your original issue?

elezar avatar May 17 '22 05:05 elezar

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]