Restrict ClusterRole and ClusterRoleBinding RBAC permissions to managed resources only
Summary
This PR improves security by restricting the GPU Operator's ClusterRole permissions to only the specific ClusterRoles and ClusterRoleBindings it manages, following the principle of least privilege.
Problem
Previously, the GPU Operator had unrestricted permissions to create, read, update, and delete any ClusterRole or ClusterRoleBinding in the entire Kubernetes cluster:
- apiGroups:
- rbac.authorization.k8s.io
resources:
- clusterroles
- clusterrolebindings
verbs:
- create
- get
- list
- watch
- update
- patch
- delete
This violates the principle of least privilege and poses security risks:
- The operator could potentially modify critical RBAC resources it doesn't own
- If compromised, the operator could escalate privileges or tamper with cluster security
- Unnecessarily broad permissions increase the blast radius of potential security incidents
Solution
The permissions have been split into two RBAC rules:
- Rule 1: Allows creating new ClusterRoles/ClusterRoleBindings (without
resourceNamesrestriction, as Kubernetes doesn't allowresourceNameswith thecreateverb) - Rule 2: Restricts
get,update,patch, anddeleteoperations to only the 14 specific resources managed by the GPU Operator using theresourceNamesfield
Resources managed by GPU Operator:
- nvidia-cc-manager
- nvidia-device-plugin
- nvidia-device-plugin-mps-control-daemon
- nvidia-driver
- nvidia-gpu-feature-discovery
- nvidia-kata-manager
- nvidia-mig-manager
- nvidia-node-status-exporter
- nvidia-operator-validator
- nvidia-sandbox-device-plugin
- nvidia-sandbox-validator
- nvidia-vfio-manager
- nvidia-vgpu-device-manager
- nvidia-vgpu-manager
Changes
File: deployments/gpu-operator/templates/clusterrole.yaml
- Split the RBAC rule for ClusterRoles and ClusterRoleBindings into two separate rules
- Added
resourceNamesconstraint toget,update,patch, anddeleteverbs - Added comments explaining the security improvement and the split-rule pattern
Security Benefits
- Prevents privilege escalation: The operator can no longer modify existing ClusterRoles/ClusterRoleBindings it doesn't own
- Limits blast radius: Reduces the impact if the operator is compromised
- Follows least privilege: Operator only has permissions for resources it actually manages
- Maintains functionality: The operator can still perform all necessary operations on its managed resources
Testing
- [x] YAML syntax validated
- [x] Verified all managed resource names are included in the
resourceNameslist - [x] Code analysis confirms the operator only manages the listed resources
Implementation Notes
The permission split (create in one rule, modify operations in another with resourceNames) is a standard Kubernetes RBAC pattern because:
- Kubernetes doesn't allow
resourceNameswith thecreateverb (resource names don't exist at creation time) - This approach still provides significant security improvement by restricting modification of existing resources
References
Code locations that manage ClusterRoles/ClusterRoleBindings:
controllers/resource_manager.go:133-140- Loads resources from YAML manifestscontrollers/object_controls.go:421-505- Creates/updates/deletes the RBAC resources