gpu-operator
gpu-operator copied to clipboard
nvidia-smi killed after a while
I run a rapidsai container with jupyter notebook. When I freshly start the container all is fine. I can run some GPU workload inside the notebook.
Thu Oct 14 09:58:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:13:00.0 Off | On |
| N/A 37C P0 65W / 250W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Then randomly the notebook kernel gets killed. When I check nvidia-smi it crashes
nvidia-smi
Thu Oct 14 09:59:49 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
Killed
I am not sure how to further debug this issue and where this comes from?
Environment: OpenShift 4.7 GPU: Nvidia A100, MIG mode using the mig manager Operator: 1.7.1
ClusterPolicy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
migManager:
nodeSelector:
nvidia.com/gpu.deploy.mig-manager: 'true'
enabled: true
imagePullSecrets: []
resources: {}
affinity: {}
podSecurityContext: {}
repository: nexus.bisinfo.org:8088/nvidia/cloud-native
env:
- name: WITH_REBOOT
value: 'true'
securityContext: {}
version: 'sha256:495ed3b42e0541590c537ab1b33bda772aad530d3ef6a4f9384d3741a59e2bf8'
image: k8s-mig-manager
tolerations: []
priorityClassName: system-node-critical
operator:
defaultRuntime: crio
initContainer:
image: cuda
imagePullSecrets: []
repository: nexus.bisinfo.org:8088/nvidia
version: 'sha256:ba39801ba34370d6444689a860790787ca89e38794a11952d89a379d2e9c87b5'
deployGFD: true
gfd:
nodeSelector:
nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
imagePullSecrets: []
resources: {}
affinity: {}
podSecurityContext: {}
repository: nexus.bisinfo.org:8088/nvidia
env:
- name: GFD_SLEEP_INTERVAL
value: 60s
- name: FAIL_ON_INIT_ERROR
value: 'true'
securityContext: {}
version: 'sha256:bfc39d23568458dfd50c0c5323b6d42bdcd038c420fb2a2becd513a3ed3be27f'
image: gpu-feature-discovery
tolerations: []
priorityClassName: system-node-critical
dcgmExporter:
nodeSelector:
nvidia.com/gpu.deploy.dcgm-exporter: 'true'
imagePullSecrets: []
resources: {}
affinity: {}
podSecurityContext: {}
repository: nexus.bisinfo.org:8088/nvidia/k8s
securityContext: {}
version: 'sha256:8af02463a8b60b21202d0bf69bc1ee0bb12f684fa367f903d138df6cacc2d0ac'
image: dcgm-exporter
tolerations: []
priorityClassName: system-node-critical
driver:
licensingConfig:
configMapName: 'licensing-config'
nodeSelector:
nvidia.com/gpu.deploy.driver: 'true'
enabled: true
imagePullSecrets: []
resources: {}
affinity: {}
podSecurityContext: {}
repository: nexus.bisinfo.org:8088/nvidia
securityContext: {}
repoConfig:
configMapName: repo-config
destinationDir: "/etc/yum.repos.d"
version: 'sha256:09ba3eca64a80fab010a9fcd647a2675260272a8c3eb515dfed6dc38a2d31ead'
image: driver
tolerations: []
priorityClassName: system-node-critical
devicePlugin:
nodeSelector:
nvidia.com/gpu.deploy.device-plugin: 'true'
imagePullSecrets: []
resources: {}
affinity: {}
podSecurityContext: {}
repository: nexus.bisinfo.org:8088/nvidia
env:
- name: PASS_DEVICE_SPECS
value: 'true'
- name: FAIL_ON_INIT_ERROR
value: 'true'
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
securityContext: {}
version: 'sha256:85def0197f388e5e336b1ab0dbec350816c40108a58af946baa1315f4c96ee05'
image: k8s-device-plugin
tolerations: []
args: []
priorityClassName: system-node-critical
mig:
strategy: single
validator:
nodeSelector:
nvidia.com/gpu.deploy.operator-validator: 'true'
imagePullSecrets: []
resources: {}
affinity: {}
podSecurityContext: {}
repository: nexus.bisinfo.org:8088/nvidia/cloud-native
env:
- name: WITH_WORKLOAD
value: 'true'
securityContext: {}
version: 'sha256:2bb62b9ca89bf9ae26399eeeeaf920d7752e617fa070c1120bf800253f624a10'
image: gpu-operator-validator
tolerations: []
priorityClassName: system-node-critical
toolkit:
nodeSelector:
nvidia.com/gpu.deploy.container-toolkit: 'true'
enabled: true
imagePullSecrets: []
resources: {}
affinity: {}
podSecurityContext: {}
repository: nexus.bisinfo.org:8088/nvidia/k8s
securityContext: {}
version: 1.5.0-ubi8
image: container-toolkit
tolerations: []
priorityClassName: system-node-critical
Any idea how to debug where this issue comes from? Also we need 11.2 support I suppose we cannot go with a newer toolkit image?
Hi @sandrich. Thanks for reporting this. With regards to the toolkit version, this is independent of the CUDA version which is determined by the driver that is installed on the system (in the case of the GPU Operator most likely by the driver container).
@klueska I recall that due to the following runc
bug we saw that long running containers would lose access to devices. Do you recall what our workaround was?
Update: The runc
bug was triggered due to CPUManager
issuing an update command for the container's CPU set every 10s irrespective as to whether changes were required. Our workaround was to patch CPUManager to only issue an update if something had changed. The changes have been merged into upstream 1.22 but I am uncertain of the backport status.
The heavy-duty workaround is to update to a version of Kubernetes that contains this patch: https://github.com/kubernetes/kubernetes/pull/101771
The lighter-weight workaround would be to make sure that your pod requests a set of exclusive CPUs as described here (even just one exclusive CPU would be sufficient): https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
@klueska that is to add a request section of at least 1 full core like so?
resources:
requests:
cpu: 1
The following resources were set in the test deployment
resources:
limits:
cpu: "1"
memory: 1000Mi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 1000Mi
nvidia.com/gpu: "1"
Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?
Yes, that is what I was suggesting. So you are seeing this error even with the setting above for CPU/memory? Is this the only container in the pod (no init containers or anything)?
Exactly. The node has cpuManagerPolicy set to static
cat /etc/kubernetes/kubelet.conf | grep cpu
"cpuManagerPolicy": "static",
"cpuManagerReconcilePeriod": "5s",
And here the pod details
oc describe pod rapidsai-998589866-dkltb
Name: rapidsai-998589866-dkltb
Namespace: med-gpu-python-dev
Priority: 0
Node: adchio1011.ocp-dev.opz.bisinfo.org/10.20.12.21
Start Time: Fri, 15 Oct 2021 14:48:40 +0200
Labels: app=rapidsai
deployment=rapidsai
pod-template-hash=998589866
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"100.70.4.26"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"100.70.4.26"
],
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
Status: Running
IP: 100.70.4.26
IPs:
IP: 100.70.4.26
Controlled By: ReplicaSet/rapidsai-998589866
Containers:
rapidsai:
Container ID: cri-o://bbf668d97da94e3a8de9b8df79a6c65ce7fa0c61026e060ce56afbcfc08b862d
Image: quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37:latest
Image ID: quay.bisinfo.org/by003457/r2106_cuda112_base_cent8-py37@sha256:10cc2b92ae96a6f402c0b9ad6901c00cd9b3d37b5040fd2ba8e6fc8b279bb06c
Port: <none>
Host Port: <none>
Command:
/opt/conda/envs/rapids/bin/jupyter-lab
--allow-root
--notebook-dir=/var/jupyter/notebook
--ip=0.0.0.0
--no-browser
--NotebookApp.token=''
--NotebookApp.allow_origin="*"
State: Running
Started: Fri, 15 Oct 2021 14:48:44 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 1000Mi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 1000Mi
nvidia.com/gpu: 1
Environment:
HOME: /tmp
Mounts:
/var/jupyter/notebook from jupyter-notebook (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6g9vj (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
jupyter-notebook:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: notebook
ReadOnly: false
default-token-6g9vj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6g9vj
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
OK. Yeah, everything looks good from the perspective of the pod specs, etc.
I’m guessing you must be running into the runc bug then: https://github.com/opencontainers/runc/issues/2366#issue-609480075
And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: https://github.com/kubernetes/kubernetes/pull/101771
I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.
OK. Yeah, everything looks good from the perspective of the pod specs, etc.
I’m guessing you must be running into the runc bug then: opencontainers/runc#2366 (comment)
And the only way to avoid that is to update to a version of runc that has a fix for this or update to a kubelet with this patch: kubernetes/kubernetes#101771
I was thinking before that ensuring you were a guaranteed pod was enough to bypass this bug, but looking into it more, it’s not.
Hi, OpenShift does not use runc but rather cri-o?
Also what we see is the following in the logs of the node
[14136.622417] cuda-EvtHandlr invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), order=0, oom_score_adj=-997
[14136.622588] CPU: 1 PID: 711806 Comm: cuda-EvtHandlr Tainted: P OE --------- - - 4.18.0-305.19.1.el8_4.x86_64 #1
[14136.622781] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.17369862.B64.2012240522 12/24/2020 [14136.622987] Call Trace:
[14136.623038] dump_stack+0x5c/0x80
[14136.623103] dump_header+0x4a/0x1db
[14136.623168] oom_kill_process.cold.32+0xb/0x10 [14136.623252] out_of_memory+0x1ab/0x4a0 [14136.623322] mem_cgroup_out_of_memory+0xe8/0x100
[14136.623406] try_charge+0x65a/0x690
[14136.623470] mem_cgroup_charge+0xca/0x220 [14136.623543] __add_to_page_cache_locked+0x368/0x3d0
[14136.623632] ? scan_shadow_nodes+0x30/0x30 [14136.623706] add_to_page_cache_lru+0x4a/0xc0 [14136.623784] iomap_readpages_actor+0x103/0x230 [14136.623865] iomap_apply+0xfb/0x330 [14136.623930] ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624010] ? __blk_mq_run_hw_queue+0x51/0xd0 [14136.624092] ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624172] iomap_readpages+0xa8/0x1e0 [14136.624242] ? iomap_ioend_try_merge+0xe0/0xe0 [14136.624322] read_pages+0x6b/0x190 [14136.624385] __do_page_cache_readahead+0x1c1/0x1e0
[14136.624470] filemap_fault+0x783/0xa20 [14136.624538] ? __mod_memcg_lruvec_state+0x21/0x100
[14136.624625] ? page_add_file_rmap+0xef/0x130 [14136.624702] ? alloc_set_pte+0x21c/0x440 [14136.624779] ? _cond_resched+0x15/0x30 [14136.624885] __xfs_filemap_fault+0x6d/0x200 [xfs] [14136.624971] __do_fault+0x36/0xd0 [14136.625033] __handle_mm_fault+0xa7a/0xca0 [14136.625108] handle_mm_fault+0xc2/0x1d0 [14136.625178] __do_page_fault+0x1ed/0x4c0 [14136.625249] do_page_fault+0x37/0x130 [14136.625316] ? page_fault+0x8/0x30 [14136.625379] page_fault+0x1e/0x30 [14136.625440] RIP: 0033:0x7fbd5b2b00e0 [14136.625508] Code: Unable to access opcode bytes at RIP 0x7fbd5b2b00b6."
I wonder if 16gb memory is not enough for the node that is serving the A100 card. It is a VM on VMWare with Direct Passthrough. We are not using vGPU
@sandrich did you try it out with increased memory mapped to VM?
@shivamerla I did which did not change anything. What did change is adding more memory to the container
@sandrich can you check if below settings are enabled on your VM:
pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=128
Yes this one is set