anaconda2196

Results 18 comments of anaconda2196

Looks like this is the similar issue of https://github.com/NVIDIA/k8s-device-plugin/issues/192. @klueska

with migstrategy=single checked with both version - v0.7.0 | v.0.9.0 After upgrading drivers ``` nvidia-smi Thu Jul 15 16:40:54 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2...

Hi @elezar I have enabled persistence mode on my A100 gpu node, but no luck with same error. ``` nvidia-smi -q ==============NVSMI LOG============== Timestamp : Fri Jul 16 07:13:16 2021...

Hi @klueska @elezar weird. inside pod if I run nvidia-smi then it is showing no mig devices. ``` nvidia-device-plugin-qffmb 1/1 Running 0 16s Abhisheks-MacBook-Pro:~ abhishekacharya$ kubectl -n kube-system logs nvidia-device-plugin-qffmb...

``` kubectl -n kube-system describe pod nvidia-device-plugin-qffmb Name: nvidia-device-plugin-qffmb Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Node: xxxx--NODE-NAME--xxx Start Time: Fri, 16 Jul 2021 10:50:53 -0700 Labels: app.kubernetes.io/instance=nvidia-device-plugin app.kubernetes.io/name=nvidia-device-plugin...

**values.yaml** ``` legacyDaemonsetAPI: false compatWithCPUManager: false migStrategy: none failOnInitError: true deviceListStrategy: envvar deviceIDStrategy: uuid nvidiaDriverRoot: "/" nameOverride: "" fullnameOverride: "" selectorLabelsOverride: {} namespace: kube-system imagePullSecrets: [] image: repository: nvcr.io/nvidia/k8s-device-plugin pullPolicy:...

``` spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchFields: - key: metadata.name operator: In values: - MY-GPU-NODE containers: - args: - --mig-strategy=single - --pass-device-specs=true - --fail-on-init-error=true - --device-list-strategy=envvar - --device-id-strategy=uuid -...

``` root@nvidia-device-plugin-mr26n:/# echo $NVIDIA_VISIBLE_DEVICES all ```

Hi @elezar @klueska Finally, I got the solution - https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_MIG_User_Guide.pdf “Toggling MIG mode requires the CAP_SYS_ADMIN capability. Other MIG management, such as creating and destroying instances, requires superuser by default,...

Yes, There is little bit confusion. Because, I have removed my A100 GPU machine from my setup and install fresh OS - Centos79 on it. Then, I have followed this...