gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Restrict ClusterRole and ClusterRoleBinding RBAC permissions to managed resources only

Open lokielse opened this issue 1 month ago • 1 comments

Summary

This PR improves security by restricting the GPU Operator's ClusterRole permissions to only the specific ClusterRoles and ClusterRoleBindings it manages, following the principle of least privilege.

Problem

Previously, the GPU Operator had unrestricted permissions to create, read, update, and delete any ClusterRole or ClusterRoleBinding in the entire Kubernetes cluster:

- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - clusterroles
  - clusterrolebindings
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete

This violates the principle of least privilege and poses security risks:

  • The operator could potentially modify critical RBAC resources it doesn't own
  • If compromised, the operator could escalate privileges or tamper with cluster security
  • Unnecessarily broad permissions increase the blast radius of potential security incidents

Solution

The permissions have been split into two RBAC rules:

  1. Rule 1: Allows creating new ClusterRoles/ClusterRoleBindings (without resourceNames restriction, as Kubernetes doesn't allow resourceNames with the create verb)
  2. Rule 2: Restricts get, update, patch, and delete operations to only the 14 specific resources managed by the GPU Operator using the resourceNames field

Resources managed by GPU Operator:

  • nvidia-cc-manager
  • nvidia-device-plugin
  • nvidia-device-plugin-mps-control-daemon
  • nvidia-driver
  • nvidia-gpu-feature-discovery
  • nvidia-kata-manager
  • nvidia-mig-manager
  • nvidia-node-status-exporter
  • nvidia-operator-validator
  • nvidia-sandbox-device-plugin
  • nvidia-sandbox-validator
  • nvidia-vfio-manager
  • nvidia-vgpu-device-manager
  • nvidia-vgpu-manager

Changes

File: deployments/gpu-operator/templates/clusterrole.yaml

  • Split the RBAC rule for ClusterRoles and ClusterRoleBindings into two separate rules
  • Added resourceNames constraint to get, update, patch, and delete verbs
  • Added comments explaining the security improvement and the split-rule pattern

Security Benefits

  1. Prevents privilege escalation: The operator can no longer modify existing ClusterRoles/ClusterRoleBindings it doesn't own
  2. Limits blast radius: Reduces the impact if the operator is compromised
  3. Follows least privilege: Operator only has permissions for resources it actually manages
  4. Maintains functionality: The operator can still perform all necessary operations on its managed resources

Testing

  • [x] YAML syntax validated
  • [x] Verified all managed resource names are included in the resourceNames list
  • [x] Code analysis confirms the operator only manages the listed resources

Implementation Notes

The permission split (create in one rule, modify operations in another with resourceNames) is a standard Kubernetes RBAC pattern because:

  • Kubernetes doesn't allow resourceNames with the create verb (resource names don't exist at creation time)
  • This approach still provides significant security improvement by restricting modification of existing resources

References

Code locations that manage ClusterRoles/ClusterRoleBindings:

  • controllers/resource_manager.go:133-140 - Loads resources from YAML manifests
  • controllers/object_controls.go:421-505 - Creates/updates/deletes the RBAC resources

lokielse avatar Nov 17 '25 04:11 lokielse