intel-device-plugins-for-kubernetes icon indicating copy to clipboard operation
intel-device-plugins-for-kubernetes copied to clipboard

Inconsistencies between different installation methods

Open eero-t opened this issue 3 years ago • 3 comments

I noticed some confusing inconsistencies when checking things under deployments folder and different installation methods...

GPU plugin is installed on different nodes depending on used method:

$ git grep feature.node.kubernetes.io | grep gpu
gpu_plugin/overlays/nfd_labeled_nodes/add-nodeselector-intel-gpu.yaml:        feature.node.kubernetes.io/pci-0300_8086.present: "true"
nfd/overlays/node-feature-rules/node-feature-rules-openshift.yaml:        "intel.feature.node.kubernetes.io/gpu": "true"
nfd/overlays/node-feature-rules/node-feature-rules.yaml:        "intel.feature.node.kubernetes.io/gpu": "true"
operator/samples/deviceplugin_v1_gpudeviceplugin.yaml:    intel.feature.node.kubernetes.io/gpu: "true"

IMHO at least the last two first options (relying on NFD), should be the same, not differ:

  • With operator, GPU plugins are deployed to all nodes with Intel GPUs
  • Using gpu_plugin/overlays/nfd_labeled_nodes, they go only to nodes with Intel GPUs supporting display output
  • Following GPU plugin README, plugin is deployed on all nodes regardless of whether they have GPU

Whereas inconsistencies in namespace usage between installation methods could mean that user ends accidentally with multiple instances of the same thing running in different namespaces:

$ cd deployments; git grep namespace: | grep -v -e { -e "'" -e % -e _test
dsa_plugin/overlays/dsa_initcontainer/dsa-config.yaml:  namespace: inteldeviceplugins-system
dsa_plugin/overlays/namespace_kube-system/add-namespace-kube-system.yaml:  namespace: kube-system
fpga_admissionwebhook/base/manager_webhook_patch.yaml:  namespace: system
fpga_admissionwebhook/certmanager/certificate.yaml:  namespace: system
fpga_admissionwebhook/certmanager/certificate.yaml:  namespace: system
fpga_admissionwebhook/default/kustomization.yaml:namespace: intelfpgawebhook-system
fpga_admissionwebhook/manager/manager.yaml:  namespace: system
fpga_admissionwebhook/rbac/role_binding.yaml:  namespace: system
fpga_admissionwebhook/webhook/kustomizeconfig.yaml:namespace:
fpga_admissionwebhook/webhook/manifests.yaml:      namespace: system
fpga_admissionwebhook/webhook/service.yaml:  namespace: system
fpga_plugin/base/intel-fpga-plugin-daemonset.yaml:  namespace: system
fpga_plugin/base/role_binding.yaml:  namespace: system
fpga_plugin/overlays/af/kustomization.yaml:namespace: intelfpgaplugin-system
fpga_plugin/overlays/region/kustomization.yaml:namespace: intelfpgaplugin-system
fpga_plugin/overlays/region/mode-region.yaml:  namespace: system
gpu_plugin/overlays/fractional_resources/resource-cluster-role-binding.yaml:  namespace: default
gpu_plugin/overlays/namespace_kube-system/add-namespace-kube-system.yaml:  namespace: kube-system
iaa_plugin/overlays/iaa_initcontainer/iaa-config.yaml:  namespace: inteldeviceplugins-system
nfd/overlays/node-feature-discovery/node-feature-discovery-openshift.yaml:  namespace: openshift-nfd
operator/certmanager/certificate.yaml:  namespace: system
operator/certmanager/certificate.yaml:  namespace: system
operator/crd/bases/deviceplugin.intel.com_dlbdeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_dsadeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_fpgadeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_gpudeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_iaadeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_qatdeviceplugins.yaml:                  namespace:
operator/crd/bases/deviceplugin.intel.com_sgxdeviceplugins.yaml:                  namespace:
operator/crd/kustomizeconfig.yaml:namespace:
operator/crd/patches/webhook_in_fpgadeviceplugins.yaml:        namespace: system
operator/crd/patches/webhook_in_gpudeviceplugins.yaml:        namespace: system
operator/crd/patches/webhook_in_qatdeviceplugins.yaml:        namespace: system
operator/default/kustomization.yaml:namespace: inteldeviceplugins-system
operator/default/manager_auth_proxy_patch.yaml:  namespace: system
operator/default/manager_webhook_patch.yaml:  namespace: system
operator/device/dlb/dlb.yaml:  namespace: inteldeviceplugins-system
operator/device/dsa/dsa.yaml:  namespace: inteldeviceplugins-system
operator/device/fpga/fpga.yaml:  namespace: inteldeviceplugins-system
operator/device/gpu/gpu.yaml:  namespace: inteldeviceplugins-system
operator/device/qat/qat.yaml:  namespace: inteldeviceplugins-system
operator/device/sgx/sgx.yaml:  namespace: inteldeviceplugins-system
operator/manager/manager.yaml:  namespace: system
operator/manifests/bases/intel-device-plugins-operator.clusterserviceversion.yaml:  namespace: placeholder
operator/rbac/auth_proxy_role_binding.yaml:  namespace: system
operator/rbac/auth_proxy_service.yaml:  namespace: system
operator/rbac/leader_election_role_binding.yaml:  namespace: system
operator/rbac/role_binding.yaml:  namespace: system
operator/webhook/kustomizeconfig.yaml:namespace:
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/manifests.yaml:      namespace: system
operator/webhook/service.yaml:  namespace: system
sgx_admissionwebhook/base/manager_webhook_patch.yaml:  namespace: system
sgx_admissionwebhook/certmanager/certificate.yaml:  namespace: system
sgx_admissionwebhook/certmanager/certificate.yaml:  namespace: system
sgx_admissionwebhook/manager/manager.yaml:  namespace: system
sgx_admissionwebhook/overlays/default-with-certmanager/kustomization.yaml:namespace: intelsgxwebhook-system
sgx_admissionwebhook/webhook/kustomizeconfig.yaml:namespace:
sgx_admissionwebhook/webhook/manifests.yaml:      namespace: system
sgx_admissionwebhook/webhook/service.yaml:  namespace: system
sgx_plugin/overlays/epc-register/kustomization.yaml:namespace: kube-system
sgx_plugin/overlays/epc-register/service-account.yaml:  namespace: kube-system
sgx_plugin/overlays/epc-register/service-account.yaml:  namespace: kube-system
vpu_plugin/overlays/namespace_kube-system/add-namespace-kube-system.yaml:  namespace: kube-system

For example, when using operator, GPU plugin goes to inteldeviceplugins-system ns, when using GPU plugin README apply method, it goes to default ns, and when using gpu_plugin/overlays/namespace_kube-system overlay, it goes to kube-system ns. Some of the other plugins, use different namespaces, but that's not consistent either.

GPU plugin also uses different service account & roles depending on installation method (this is from repo root):

$ git grep gpu-manager
Makefile:       $(CONTROLLER_GEN) rbac:roleName=gpu-manager-role paths="./cmd/gpu_plugin/..." output:dir=deployments/operator/rbac
charts/operator/templates/operator.yaml:  name: inteldeviceplugins-gpu-manager-role
deployments/operator/rbac/gpu_manager_role.yaml:  name: gpu-manager-role
pkg/controllers/gpu/controller.go:      serviceAccountName = "gpu-manager-sa"
pkg/controllers/gpu/controller.go:                              Name:      "gpu-manager-sa",
pkg/controllers/gpu/controller.go:                              Name:      "gpu-manager-rolebinding",
pkg/controllers/gpu/controller.go:                                      Name:      "gpu-manager-sa",
pkg/controllers/gpu/controller.go:                              Name:     "inteldeviceplugins-gpu-manager-role",

$ git grep resource-reader
cmd/gpu_plugin/README.md:serviceaccount/resource-reader-sa created
cmd/gpu_plugin/README.md:clusterrole.rbac.authorization.k8s.io/resource-reader created
cmd/gpu_plugin/README.md:clusterrolebinding.rbac.authorization.k8s.io/resource-reader-rb created
deployments/gpu_plugin/overlays/fractional_resources/add-serviceaccount.yaml:      serviceAccountName: resource-reader-sa
deployments/gpu_plugin/overlays/fractional_resources/kustomization.yaml:  - resource-reader-sa.yaml
deployments/gpu_plugin/overlays/fractional_resources/resource-cluster-role-binding.yaml:  name: resource-reader-rb
deployments/gpu_plugin/overlays/fractional_resources/resource-cluster-role-binding.yaml:  name: resource-reader-sa
deployments/gpu_plugin/overlays/fractional_resources/resource-cluster-role-binding.yaml:  name: resource-reader
deployments/gpu_plugin/overlays/fractional_resources/resource-cluster-role.yaml:  name: resource-reader
deployments/gpu_plugin/overlays/fractional_resources/resource-reader-sa.yaml:  name: resource-reader-sa

eero-t avatar Jun 22 '22 13:06 eero-t

Tuomas fixed the deployment rule inconsistency in https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1169 and Ukri is fixing overlay object names inconsistency in https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1172.

As to remaining namespace inconsistency...

According to @ukri, it's better that manually installed GPU plugin, and operator operating automatically on k8s GPU plugin components, are in separate namespaces. But at least the namespace overlay and install doc example should use the same namespace.

Namespace can be given as kubectl/kustomize command line option. Maybe it's best if namespace overlay is deprecated (+ eventually removed), and install doc uses e.g. "intel-gpu" as the namespace in kubectl example?

(I suggested "intel-gpu" namespace to indicate that pods in it are Intel related components unlike generic stuff in "kube-system", which is used by the current namespace overlay.)

eero-t avatar Sep 26 '22 19:09 eero-t

I grepped the namespaces from deployments again:

$ grep -sr namespace: * | grep "namespace: ." | grep -v -e " system" -e inteldeviceplugins-system
fpga_admissionwebhook/default/kustomization.yaml:namespace: intelfpgawebhook-system
fpga_plugin/overlays/af/kustomization.yaml:namespace: intelfpgaplugin-system
fpga_plugin/overlays/region/kustomization.yaml:namespace: intelfpgaplugin-system
gpu_plugin/overlays/fractional_resources/gpu-manager-rolebinding.yaml:  namespace: default
nfd/overlays/node-feature-discovery/node-feature-discovery-openshift.yaml:  namespace: openshift-nfd
operator/manifests/bases/intel-device-plugins-operator.clusterserviceversion.yaml:  namespace: placeholder
sgx_admissionwebhook/overlays/default-with-certmanager/kustomization.yaml:namespace: intelsgxwebhook-system
sgx_plugin/overlays/epc-register/service-account.yaml:  namespace: kube-system
sgx_plugin/overlays/epc-register/service-account.yaml:  namespace: kube-system
sgx_plugin/overlays/epc-register/kustomization.yaml:namespace: kube-system
xpumanager_sidecar/kustomization.yaml:namespace: monitoring

Some suggestions:

  • Rename intelfpgawebhook-system and intelfpgaplugin-system as inteldeviceplugins-system ?
    • Needs modifications to demo scripts also.
  • Rename intelsgxwebhook-system as inteldeviceplugins-system ?
  • For sgx's epc-register, rename kube-system as inteldeviceplugins-system ?

tkatila avatar Apr 18 '23 08:04 tkatila

Suggestions look fine to me. @mythi ?

eero-t avatar Apr 18 '23 08:04 eero-t