gpu-operator
gpu-operator copied to clipboard
Parsing error when nvidia-container-toolkit load the runtime configuration on a RKE2 cluster
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Flatcar 3033.3.8
- Kernel Version: 5.10.157
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd / v1.6.8-k3s1
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): RKE2 / v1.24.8+rke2r1
- GPU Operator Version: 22.9.1
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
RKE2 uses a template file to generate the configuration of containerd (config.toml.tmpl). In my situation, this file contains some variables. When the nvidia-container-toolkit reads this configuration it stops with a parsing error : "Error: unable to load config: (9, 21): unexpected token type in inline table: keys cannot contain { character"
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
Deploy the gpu-operator using the helm chart. The following configuration is used for the container toolkit :
env:
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
installDir: /run/nvidia/toolkit
If the config.toml.tmpl file contains templatized variables, the container-toolkit stops with the above error message. If I replace the variables by "direct" toml syntax in the config.toml.tmpl then the container-toolkit succeeds. Below is an extract of a tmpl file that generates the error:
version = 2
[plugins."io.containerd.internal.v1.opt"]
path = "{{ .NodeConfig.Containerd.Opt }}"
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = {{ .NodeConfig.SELinux }}
{{- if .DisableCgroup}}
disable_cgroup = true
{{end}}
{{- if .IsRunningInUserNS }}
disable_apparmor = true
restrict_oom_score_adj = true
{{end}}
4. Information to attach (optional if deemed irrelevant)
- [x] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE nvidia-driver-daemonset-krrp5 1/1 Running 0 87m nvidia-container-toolkit-daemonset-dn6kw 0/1 CrashLoopBackOff 9 (62m ago) 87m - [ ] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
- [ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
- [x] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
kubectl logs nvidia-container-toolkit-daemonset-dn6kw -n gpu-operator time="2024-02-21T08:41:55Z" level=info msg="Loading config: /runtime/config-dir/config.toml.tmpl" time="2024-02-21T08:41:55Z" level=fatal msg="Error: unable to load config: (9, 21): unexpected token type in inline table: keys cannot contain { character" time="2024-02-21T08:41:55Z" level=info msg="Shutting Down" - [ ] Output from running
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
- [ ] containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
You're passing in a templatised config toml (/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
)as custom containerd config. That will not work as it not a file with valid TOML syntax. You will have to evaluate that template file and pass the rendered config toml as input
Thanks for the reply. I don't know the exact behaviour of RKE2 regarding the generation of the configuration, but as it uses the custom config, I guess it is to have a single file to avoid different files depending on the node configuration. Then, if I must evaluate the final if I use the nvidia-container-toolkit I'll lose this capacity of customization, dont'I ? By the way, I wonder if this discussion should have been opened in the container-toolkit repo instead of the operator.
You're passing in a templatised config toml (
/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
)as custom containerd config. That will not work as it not a file with valid TOML syntax. You will have to evaluate that template file and pass the rendered config toml as input
But in this guide: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-deployment-scenarios, they are passing .toml.tmpl too. Which one is correct? And by the way I am following that guide setting the CONTAINERD variables the same as the author of this issue, but these variables are only updated in the ClusterPolicy and the variables in the toolkit-daemonset remain the same. Is this expected ?