kubespray icon indicating copy to clipboard operation
kubespray copied to clipboard

Is there a configuration for using nvidia gpu in kubespray?

Open misupopo opened this issue 2 years ago • 2 comments

The nvidia container runtime does not lookes like to be working. It looks like kubespray has a configuration item for nvidia gpu, but is it possible to set something in ansible’s variable?

https://github.com/kubernetes-sigs/kubespray/blob/master/roles/kubernetes-apps/container_engine_accelerator/nvidia_gpu/tasks/main.yml

misupopo avatar Aug 19 '22 09:08 misupopo

Hi @misupopo Could you explain the issue more?

The nvidia container runtime does not lookes like to be working.

What is the actual error message you are facing?

It looks like kubespray has a configuration item for nvidia gpu, but is it possible to set something in ansible’s variable?

What does something mean? What kind of configuration item do you need to specify?

oomichi avatar Sep 06 '22 02:09 oomichi

Hi, This week, we've tested but apparently, nvidia_gpu tasks are outdated. It doesn't work.

Kubespray doesn't change the runtime plugin for docker, containerd, or any other runtime. GPU nodes must use the nvidia runtime plugin. It comes with a nvidia-container-toolkit packet (for docker, it uses nvidia-docker2 packet). The toolkit depends on nvidia-driver.

What kubespray needs to do 1 - Install nvidia-driver 2 - Install nvidia-container-toolkit (for docker, nvidia-docker2) 3 - Deploy nvidia-k8s-device-plugin in the GPU nodes.

However, Nvida has an operator to install the packages, label the nodes and do more. The above solution is simple and low-cost.

We are currently working on this, we will submit a PR for this.

cc\ @Dentrax @developer-guy @necatican

eminaktas avatar Sep 14 '22 15:09 eminaktas

I'm not sure if Kubespray is the best place to do that, especially if you get into dependencies on the NVIDIA drivers and which drivers and setup should be used depending on whether you do bare metal/VGPU/MIG, but what I have done is install components separately from Kubespray as much as possible.

Installing container toolkit is just a yum repo and some packages: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#step-2-install-nvidia-container-toolkit

The part that needs to integrate with Kubespray is this in your inventory, for containerd config:

is_gpu_node:  # insert True/False or some node-dependent logic 
containerd_runtimes_nvidia:
  - name: nvidia
    type: "io.containerd.runc.v1"
    engine: ""
    root: ""
    options:
      systemdCgroup: "true"
      BinaryName: "\"/usr/bin/nvidia-container-runtime\""

containerd_additional_runtimes: "{{ containerd_runtimes_nvidia if is_gpu_node else [] }}"

# TODO FIXME simplify this after https://github.com/kubernetes-sigs/kubespray/pull/9026
# For nvidia device plugin, see https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
containerd_default_runtime: "{{ 'nvidia' if is_gpu_node else 'runc' }}"

Then you just run kubespray, and then we add the nvidia-device-plugin helm chart (including with GFD enabled) on top.

I would suggest any app installations on top of Kubespray should be done based on https://github.com/kubernetes-sigs/kubespray/pull/8347 via helm charts instead of static YAML manifests.

rptaylor avatar Sep 26 '22 23:09 rptaylor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 26 '22 00:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 25 '23 01:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Feb 24 '23 01:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 24 '23 01:02 k8s-ci-robot