intel-device-plugins-for-kubernetes icon indicating copy to clipboard operation
intel-device-plugins-for-kubernetes copied to clipboard

Missing GPU plugin documentation items

Open eero-t opened this issue 2 years ago • 2 comments

GPU plugin documentation is missing following things:

  • Description / table of the resources provided by GPU plugin ('i915', 'i915_monitor', 'tiles'?) and how they work
  • Discussion on user / group ID issues for device access, or just link to: https://kubernetes.io/blog/2021/11/09/non-root-containers-and-devices/
  • Link to example deployment YAMLs that use GPU resources, e.g. to the integpu-job.yaml referred in README.md Testing and Demos section: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/demo/intelgpu-job.yaml
  • Maybe als "Quick start" section, that lists just basics, as current doc is pretty long...

eero-t avatar May 02 '23 13:05 eero-t

I'd add another item to your list: Simplify GPU plugin deployment options.

I would vote to only have two examples:

  • NFD + GPU-plugin with shared-dev-num=1 and monitoring
    • Basic use case, should work for most
  • NFD + GPU-plugin with shared-dev-num>1, resource management, monitoring and extended resources
    • GAS use case, for those who need it

Then I'd add notes about configuration options, using shared-dev-num without GAS etc. into a different file (advanced-deployment.md or similar). As you say, the current README is pretty long.

tkatila avatar May 04 '23 11:05 tkatila

IMHO "using shared-dev-num without GAS" should be documented as being for "dedicated cluster/nodes with a single GPU workload, where share count equals to how many instances of that workload fit into a single GPU (with required QoS)".

AFAIK other uses for it are non-production ones, so they do not need to be mentioned.

eero-t avatar May 05 '23 13:05 eero-t