gpu-operator Parsing error when nvidia-container-toolkit load the runtime configuration on a RKE2 cluster

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Flatcar 3033.3.8
Kernel Version: 5.10.157
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd / v1.6.8-k3s1
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): RKE2 / v1.24.8+rke2r1
GPU Operator Version: 22.9.1

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

RKE2 uses a template file to generate the configuration of containerd (config.toml.tmpl). In my situation, this file contains some variables. When the nvidia-container-toolkit reads this configuration it stops with a parsing error : "Error: unable to load config: (9, 21): unexpected token type in inline table: keys cannot contain { character"

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

Deploy the gpu-operator using the helm chart. The following configuration is used for the container toolkit :

    env: 
      - name: CONTAINERD_SOCKET
        value: /run/k3s/containerd/containerd.sock
      - name: CONTAINERD_CONFIG
        value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
    installDir: /run/nvidia/toolkit

If the config.toml.tmpl file contains templatized variables, the container-toolkit stops with the above error message. If I replace the variables by "direct" toml syntax in the config.toml.tmpl then the container-toolkit succeeds. Below is an extract of a tmpl file that generates the error:

version = 2
 
[plugins."io.containerd.internal.v1.opt"]
  path = "{{ .NodeConfig.Containerd.Opt }}"
 
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = {{ .NodeConfig.SELinux }}
 
{{- if .DisableCgroup}}
  disable_cgroup = true
{{end}}
{{- if .IsRunningInUserNS }}
  disable_apparmor = true
  restrict_oom_score_adj = true
{{end}}

4. Information to attach (optional if deemed irrelevant)

[x] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE nvidia-driver-daemonset-krrp5 1/1 Running 0 87m nvidia-container-toolkit-daemonset-dn6kw 0/1 CrashLoopBackOff 9 (62m ago) 87m
[ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
[x] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers kubectl logs nvidia-container-toolkit-daemonset-dn6kw -n gpu-operator time="2024-02-21T08:41:55Z" level=info msg="Loading config: /runtime/config-dir/config.toml.tmpl" time="2024-02-21T08:41:55Z" level=fatal msg="Error: unable to load config: (9, 21): unexpected token type in inline table: keys cannot contain { character" time="2024-02-21T08:41:55Z" level=info msg="Shutting Down"
[ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
[ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Feb 21 '24 09:02 s-bonnet

You're passing in a templatised config toml (/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl)as custom containerd config. That will not work as it not a file with valid TOML syntax. You will have to evaluate that template file and pass the rendered config toml as input

Feb 26 '24 17:02 tariq1890

Thanks for the reply. I don't know the exact behaviour of RKE2 regarding the generation of the configuration, but as it uses the custom config, I guess it is to have a single file to avoid different files depending on the node configuration. Then, if I must evaluate the final if I use the nvidia-container-toolkit I'll lose this capacity of customization, dont'I ? By the way, I wonder if this discussion should have been opened in the container-toolkit repo instead of the operator.

Feb 27 '24 08:02 s-bonnet

You're passing in a templatised config toml (/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl)as custom containerd config. That will not work as it not a file with valid TOML syntax. You will have to evaluate that template file and pass the rendered config toml as input

But in this guide: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-deployment-scenarios, they are passing .toml.tmpl too. Which one is correct? And by the way I am following that guide setting the CONTAINERD variables the same as the author of this issue, but these variables are only updated in the ClusterPolicy and the variables in the toolkit-daemonset remain the same. Is this expected ?

Feb 27 '24 10:02 haiminh2001

gpu-operator gpu-operator copied to clipboard

Parsing error when nvidia-container-toolkit load the runtime configuration on a RKE2 cluster

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

gpu-operator
gpu-operator copied to clipboard