gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Fedora CoreOS official support in all components

Open dfateyev opened this issue 10 months ago • 4 comments

As announced in the official documentation, currently there is no support for recent Fedora CoreOS-based workers in Kubernetes. There are no official GPU driver images published, and no official recommendations on how to deploy GPU operator to Kubernetes with Fedora CoreOS hosts.

We currently have Kubernetes solutions in Openstack (it features Fedora CoreOS and containerd). In order to use the GPU operator functionality, we should utilize various hacks and workarounds, along with a custom GPU driver image: running GPU driver image and Toolkit on the nodes separately out of Kubernetes scope, then deploying the GPU operator in Kubernetes, disabling already present features. This deployment approach is pretty cumbersome.

We are interested in the official Fedora CoreOS support both in the operator and GPU driver.
In the ideal scenario, we would like to install the GPU operator to deploy all the components working in Kubernetes with containerd out-of-box. We understand that we might need a custom GPU driver image — but without even initial CoreOS native support it's hard to prepare it.

There were several requests for better support Fedora CoreOS driver images, e.g. #34 and #8, and we would like to extend this request to better support in all GPU operator components.
We understand that "support in all components out-of-box" is a pretty broad subject — but we could start at least from something, gradually improving and testing the functionality.

1. Quick Debug Information

  • OS/Version: Fedora CoreOS 39
  • Kernel Version: 6.5.11-300.fc39.x86_64
  • Container Runtime Type/Version: Containerd
  • K8s Flavor/Version: Kubernetes (Magnum in Openstack)
  • GPU Operator Version: gpu-operator-v23.9.2

2. Issue or feature description

We have prepare a custom (unofficial) GPU driver image to use the operator functionality — Fedora's from the repo doesn't work out-of-box, but can start with workarounds. But, nvidia-operator-validator cannot finish the deployment validation, anyway.

3. Steps to reproduce the issue

  • Install operator to Openstack Magnum cluster with: helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.usePrecompiled=true --set driver.version="550.54.15" --set driver.repository="docker.io/dfateyev", where "dfateyev/driver" is a custom GPU driver image for this Kubernetes cluster;
  • The nvidia-smi cannot address GPU device files: we need to prepare them explicitly like this;
  • The nvidia-operator-validator fails to start properly (see in the attached logs below).

4. Information to attach

Attached logs: issue-696-logs.zip

dfateyev avatar Apr 12 '24 13:04 dfateyev