kubespray
kubespray copied to clipboard
Is there a configuration for using nvidia gpu in kubespray?
The nvidia container runtime does not lookes like to be working. It looks like kubespray has a configuration item for nvidia gpu, but is it possible to set something in ansible’s variable?
https://github.com/kubernetes-sigs/kubespray/blob/master/roles/kubernetes-apps/container_engine_accelerator/nvidia_gpu/tasks/main.yml
Hi @misupopo Could you explain the issue more?
The nvidia container runtime does not lookes like to be working.
What is the actual error message you are facing?
It looks like kubespray has a configuration item for nvidia gpu, but is it possible to set something in ansible’s variable?
What does something
mean?
What kind of configuration item do you need to specify?
Hi, This week, we've tested but apparently, nvidia_gpu tasks are outdated. It doesn't work.
Kubespray doesn't change the runtime plugin for docker, containerd, or any other runtime. GPU nodes must use the nvidia
runtime plugin. It comes with a nvidia-container-toolkit
packet (for docker, it uses nvidia-docker2 packet). The toolkit depends on nvidia-driver
.
What kubespray needs to do
1 - Install nvidia-driver
2 - Install nvidia-container-toolkit (for docker, nvidia-docker2)
3 - Deploy nvidia-k8s-device-plugin
in the GPU nodes.
However, Nvida has an operator to install the packages, label the nodes and do more. The above solution is simple and low-cost.
We are currently working on this, we will submit a PR for this.
cc\ @Dentrax @developer-guy @necatican
I'm not sure if Kubespray is the best place to do that, especially if you get into dependencies on the NVIDIA drivers and which drivers and setup should be used depending on whether you do bare metal/VGPU/MIG, but what I have done is install components separately from Kubespray as much as possible.
Installing container toolkit is just a yum repo and some packages: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#step-2-install-nvidia-container-toolkit
The part that needs to integrate with Kubespray is this in your inventory, for containerd config:
is_gpu_node: # insert True/False or some node-dependent logic
containerd_runtimes_nvidia:
- name: nvidia
type: "io.containerd.runc.v1"
engine: ""
root: ""
options:
systemdCgroup: "true"
BinaryName: "\"/usr/bin/nvidia-container-runtime\""
containerd_additional_runtimes: "{{ containerd_runtimes_nvidia if is_gpu_node else [] }}"
# TODO FIXME simplify this after https://github.com/kubernetes-sigs/kubespray/pull/9026
# For nvidia device plugin, see https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
containerd_default_runtime: "{{ 'nvidia' if is_gpu_node else 'runc' }}"
Then you just run kubespray, and then we add the nvidia-device-plugin helm chart (including with GFD enabled) on top.
I would suggest any app installations on top of Kubespray should be done based on https://github.com/kubernetes-sigs/kubespray/pull/8347 via helm charts instead of static YAML manifests.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.