gpu-operator nvidia-smi killed after a while

I run a rapidsai container with jupyter notebook. When I freshly start the container all is fine. I can run some GPU workload inside the notebook.

Thu Oct 14 09:58:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:13:00.0 Off |                   On |
| N/A   37C    P0    65W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Then randomly the notebook kernel gets killed. When I check nvidia-smi it crashes

nvidia-smi
Thu Oct 14 09:59:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
Killed

I am not sure how to further debug this issue and where this comes from?

Environment: OpenShift 4.7 GPU: Nvidia A100, MIG mode using the mig manager Operator: 1.7.1

ClusterPolicy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  migManager:
    nodeSelector:
      nvidia.com/gpu.deploy.mig-manager: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/cloud-native
    env:
      - name: WITH_REBOOT
        value: 'true'
    securityContext: {}
    version: 'sha256:495ed3b42e0541590c537ab1b33bda772aad530d3ef6a4f9384d3741a59e2bf8'
    image: k8s-mig-manager
    tolerations: []
    priorityClassName: system-node-critical
  operator:
    defaultRuntime: crio
    initContainer:
      image: cuda
      imagePullSecrets: []
      repository: nexus.bisinfo.org:8088/nvidia
      version: 'sha256:ba39801ba34370d6444689a860790787ca89e38794a11952d89a379d2e9c87b5'
    deployGFD: true
  gfd:
    nodeSelector:
      nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    env:
      - name: GFD_SLEEP_INTERVAL
        value: 60s
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
    securityContext: {}
    version: 'sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f'
    image: gpu-feature-discovery
    tolerations: []
    priorityClassName: system-node-critical
  dcgmExporter:
    nodeSelector:
      nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/k8s
    securityContext: {}
    version: 'sha256:8af02463a8b60b21202d0bf69bc1ee0bb12f684fa367f903d138df6cacc2d0ac'
    image: dcgm-exporter
    tolerations: []
    priorityClassName: system-node-critical
  driver:
    licensingConfig:
      configMapName: 'licensing-config'
    nodeSelector:
      nvidia.com/gpu.deploy.driver: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    securityContext: {}
    repoConfig:
      configMapName: repo-config
      destinationDir: "/etc/yum.repos.d"
    version: 'sha256:09ba3eca64a80fab010a9fcd647a2675260272a8c3eb515dfed6dc38a2d31ead'
    image: driver
    tolerations: []
    priorityClassName: system-node-critical
  devicePlugin:
    nodeSelector:
      nvidia.com/gpu.deploy.device-plugin: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia
    env:
      - name: PASS_DEVICE_SPECS
        value: 'true'
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    securityContext: {}
    version: 'sha256:85def0197f388e5e336b1ab0dbec350816c40108a58af946baa1315f4c96ee05'
    image: k8s-device-plugin
    tolerations: []
    args: []
    priorityClassName: system-node-critical
  mig:
    strategy: single
  validator:
    nodeSelector:
      nvidia.com/gpu.deploy.operator-validator: 'true'
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/cloud-native
    env:
      - name: WITH_WORKLOAD
        value: 'true'
    securityContext: {}
    version: 'sha256:2bb62b9ca89bf9ae26399eeeeaf920d7752e617fa070c1120bf800253f624a10'
    image: gpu-operator-validator
    tolerations: []
    priorityClassName: system-node-critical
  toolkit:
    nodeSelector:
      nvidia.com/gpu.deploy.container-toolkit: 'true'
    enabled: true
    imagePullSecrets: []
    resources: {}
    affinity: {}
    podSecurityContext: {}
    repository: nexus.bisinfo.org:8088/nvidia/k8s
    securityContext: {}
    version: 1.5.0-ubi8
    image: container-toolkit
    tolerations: []
    priorityClassName: system-node-critical

Any idea how to debug where this issue comes from? Also we need 11.2 support I suppose we cannot go with a newer toolkit image?

Oct 14 '21 10:10 sandrich

Hi @sandrich. Thanks for reporting this. With regards to the toolkit version, this is independent of the CUDA version which is determined by the driver that is installed on the system (in the case of the GPU Operator most likely by the driver container).

@klueska I recall that due to the following runc bug we saw that long running containers would lose access to devices. Do you recall what our workaround was?

Update: The runc bug was triggered due to CPUManager issuing an update command for the container's CPU set every 10s irrespective as to whether changes were required. Our workaround was to patch CPUManager to only issue an update if something had changed. The changes have been merged into upstream 1.22 but I am uncertain of the backport status.

Oct 14 '21 12:10 elezar

The heavy-duty workaround is to update to a version of Kubernetes that contains this patch: https://github.com/kubernetes/kubernetes/pull/101771

The lighter-weight workaround would be to make sure that your pod requests a set of exclusive CPUs as described here (even just one exclusive CPU would be sufficient): https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/

Oct 14 '21 12:10 klueska

@klueska that is to add a request section of at least 1 full core like so?

resources:
      requests:
        cpu: 1

The following resources were set in the test deployment

resources:
          limits:
            cpu: "1"
            memory: 1000Mi
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 1000Mi
            nvidia.com/gpu: "1"

Oct 14 '21 12:10 sandrich

Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?

Oct 14 '21 21:10 klueska

Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?

Exactly. The node has cpuManagerPolicy set to static

cat /etc/kubernetes/kubelet.conf | grep cpu
  "cpuManagerPolicy": "static",
  "cpuManagerReconcilePeriod": "5s",

And here the pod details

oc describe pod rapidsai-998589866-dkltb
Name:         rapidsai-998589866-dkltb
Namespace:    med-gpu-python-dev
Priority:     0
Node:         adchio1011.ocp-dev.opz.bisinfo.org/10.20.12.21
Start Time:   Fri, 15 Oct 2021 14:48:40 +0200
Labels:       app=rapidsai
              deployment=rapidsai
              pod-template-hash=998589866
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "100.70.4.26"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "100.70.4.26"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted
Status:       Running
IP:           100.70.4.26
IPs:
  IP:           100.70.4.26
Controlled By:  ReplicaSet/rapidsai-998589866
Containers:
  rapidsai:
    Container ID:  cri-o://bbf668d97da94e3a8de9b8df79a6c65ce7fa0c61026e060ce56afbcfc08b862d
    Image:         quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37:latest
    Image ID:      quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37@sha256:10cc2b92ae96a6f402c0b9ad6901c00cd9b3d37b5040fd2ba8e6fc8b279bb06c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/conda/envs/rapids/bin/jupyter-lab
      --allow-root
      --notebook-dir=/var/jupyter/notebook
      --ip=0.0.0.0
      --no-browser
      --NotebookApp.token=''
      --NotebookApp.allow_origin="*"
    State:          Running
      Started:      Fri, 15 Oct 2021 14:48:44 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          1000Mi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1000Mi
      nvidia.com/gpu:  1
    Environment:
      HOME:  /tmp
    Mounts:
      /var/jupyter/notebook from jupyter-notebook (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6g9vj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  jupyter-notebook:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  notebook
    ReadOnly:   false
  default-token-6g9vj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6g9vj
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Oct 15 '21 13:10 sandrich

OK. Yeah, everything looks good from the perspective of the pod specs, etc.

I’m guessing you must be running into the runc bug then: https://github.com/opencontainers/runc/issues/2366#issue-609480075

And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: https://github.com/kubernetes/kubernetes/pull/101771

I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.

Oct 15 '21 13:10 klueska

OK. Yeah, everything looks good from the perspective of the pod specs, etc.

I’m guessing you must be running into the runc bug then: opencontainers/runc#2366 (comment)

And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: kubernetes/kubernetes#101771

I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.

Hi, OpenShift does not use runc but rather cri-o?

Oct 19 '21 12:10 sandrich

Also what we see is the following in the logs of the node

[14136.622417] cuda-EvtHandlr invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=-997
[14136.622588] CPU: 1 PID: 711806 Comm: cuda-EvtHandlr Tainted: P           OE    --------- -  - 4.18.0-305.19.1.el8_4.x86_64 #1
[14136.622781] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.17369862.B64.2012240522 12/24/2020 [14136.622987] Call Trace:
[14136.623038]  dump_stack+0x5c/0x80
[14136.623103]  dump_header+0x4a/0x1db
[14136.623168]  oom_kill_process.cold.32+0xb/0x10 [14136.623252]  out_of_memory+0x1ab/0x4a0 [14136.623322]  mem_cgroup_out_of_memory+0xe8/0x100
[14136.623406]  try_charge+0x65a/0x690
[14136.623470]  mem_cgroup_charge+0xca/0x220 [14136.623543]  __add_to_page_cache_locked+0x368/0x3d0
[14136.623632]  ? scan_shadow_nodes+0x30/0x30 [14136.623706]  add_to_page_cache_lru+0x4a/0xc0 [14136.623784]  iomap_readpages_actor+0x103/0x230 [14136.623865]  iomap_apply+0xfb/0x330 [14136.623930]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624010]  ? __blk_mq_run_hw_queue+0x51/0xd0 [14136.624092]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624172]  iomap_readpages+0xa8/0x1e0 [14136.624242]  ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624322]  read_pages+0x6b/0x190 [14136.624385]  __do_page_cache_readahead+0x1c1/0x1e0
[14136.624470]  filemap_fault+0x783/0xa20 [14136.624538]  ? __mod_memcg_lruvec_state+0x21/0x100
[14136.624625]  ? page_add_file_rmap+0xef/0x130 [14136.624702]  ? alloc_set_pte+0x21c/0x440 [14136.624779]  ? _cond_resched+0x15/0x30 [14136.624885]  __xfs_filemap_fault+0x6d/0x200 [xfs] [14136.624971]  __do_fault+0x36/0xd0 [14136.625033]  __handle_mm_fault+0xa7a/0xca0 [14136.625108]  handle_mm_fault+0xc2/0x1d0 [14136.625178]  __do_page_fault+0x1ed/0x4c0 [14136.625249]  do_page_fault+0x37/0x130 [14136.625316]  ? page_fault+0x8/0x30 [14136.625379]  page_fault+0x1e/0x30 [14136.625440] RIP: 0033:0x7fbd5b2b00e0 [14136.625508] Code: Unable to access opcode bytes at RIP 0x7fbd5b2b00b6."

I wonder if 16gb memory is not enough for the node that is serving the A100 card. It is a VM on VMWare with Direct Passthrough. We are not using vGPU

Oct 19 '21 13:10 sandrich

@sandrich did you try it out with increased memory mapped to VM?

Oct 25 '21 19:10 shivamerla

@shivamerla I did which did not change anything. What did change is adding more memory to the container

Oct 25 '21 20:10 sandrich

@sandrich can you check if below settings are enabled on your VM:

pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=128

Nov 24 '21 01:11 shivamerla

Yes this one is set

Nov 24 '21 12:11 sandrich

gpu-operator gpu-operator copied to clipboard

nvidia-smi killed after a while

gpu-operator
gpu-operator copied to clipboard