trident icon indicating copy to clipboard operation
trident copied to clipboard

The Trident operator fails to install via Helm on Rancher

Open lindhe opened this issue 2 years ago • 8 comments

Describe the bug

When installing the Trident operator from the Helm chart in a Kubernetes cluster managed by Rancher, the operator fails because it is unable to add the PSA label pod-security.kubernetes.io/enforce: privileged on its installation namespace. This is because Rancher has a special admission webhook in place for setting PSA labels, which must be granted to the ServiceAccount, on top of all the other RBAC rules it needs.

Environment

  • Trident version: 23.04.0
  • Trident installation flags used: helm install trident netapp-trident/trident-operator --version 23.04.0 --create-namespace --namespace trident
  • Container runtime: Containerd v1.6.19-k3s1
  • Kubernetes version: v1.25.9
  • Kubernetes orchestrator: Rancher v2.7.5
  • Kubernetes enabled feature gates: None.
  • OS: Ubuntu 22.04.2 LTS
  • NetApp backend types: n/a
  • Other: n/a

To Reproduce

  1. Have a Rancher managed RKE2 cluster (but I'm guessing it'll work with any Rancher managed cluster).

  2. helm repo add netapp-trident https://netapp.github.io/trident-helm-chart

  3. helm install trident netapp-trident/trident-operator --version 23.04.0 --create-namespace --namespace trident

  4. Check the status of the installed CRDs, thetrident TridentOrchestrator object and the pods deployed:

    $ kubectl get crd | grep trident
    tridentorchestrators.trident.netapp.io                            2023-06-28T14:56:46Z
    
    $ kubectl -n trident get pods
    NAME                                 READY    STATUS    RESTARTS    AGE
    trident-operator-5789cf4777-nc4vn    1/1      Runnnig   0           7m32s
    
    $ kubectl -n trident get tridentorchestrators trident -o yaml
     […]
     status:
       message: 'Failed to install Trident; err: failed to patch Trident installation namespace
         trident; admission webhook "rancher.cattle.io.namespaces" denied the request:
         Unauthorized'
       namespace: trident
       status: Failed
       version: ""
    

Expected behavior

I expect it to deploy as it should and not crash. Here's an example of what it looks like when deploying successfully:

$ kubectl -n trident get pods
NAME                                  READY   STATUS    RESTARTS   AGE
trident-controller-6d7c9c5d8c-wg8zj   6/6     Running   0          4h28m
trident-node-linux-4tk6q              2/2     Running   0          4h28m
trident-node-linux-97rgx              2/2     Running   0          4h28m
trident-node-linux-9jfbh              2/2     Running   0          4h28m
trident-node-linux-btjx6              2/2     Running   0          4h28m
trident-node-linux-n5k75              2/2     Running   0          4h28m
trident-node-linux-vpcgd              2/2     Running   0          4h28m
trident-operator-5789cf4777-66mth     1/1     Running   0          4h29m

$ kubectl get crd | grep trident
tridentbackendconfigs.trident.netapp.io                           2023-07-05T08:09:56Z
tridentbackends.trident.netapp.io                                 2023-07-05T08:09:55Z
tridentmirrorrelationships.trident.netapp.io                      2023-07-05T08:10:00Z
tridentnodes.trident.netapp.io                                    2023-07-05T08:09:58Z
tridentorchestrators.trident.netapp.io                            2023-06-28T14:56:46Z
tridentsnapshotinfos.trident.netapp.io                            2023-07-05T08:09:56Z
tridentsnapshots.trident.netapp.io                                2023-07-05T08:09:59Z
tridentstorageclasses.trident.netapp.io                           2023-07-05T08:09:56Z
tridenttransactions.trident.netapp.io                             2023-07-05T08:09:59Z
tridentversions.trident.netapp.io                                 2023-07-05T08:09:55Z
tridentvolumepublications.trident.netapp.io                       2023-07-05T08:09:57Z
tridentvolumereferences.trident.netapp.io                         2023-07-05T08:10:00Z
tridentvolumes.trident.netapp.io                                  2023-07-05T08:09:57Z

Additional context

This was already reported to Rancher's GitHub page as issue #41191. People (understandably) thought that this was a bug in Rancher, while it's more of a documentation issue on their part (in my opinion).

There's also some information available in the operator's pod logs. I don't have them easily available right now, but it basically amounts to the same message as the one displayed by the TridentOrchestrator object anyway; it fails to patch the trident namespace because the Rancher admission webhook rancher.cattle.io.namespaces denied the request (Unauthorized).

Work-around

Inspired by this comment from the issue reported to Rancher's GitHub page, applying the following manifest and then restarting the operator fixes the issue:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: trident-operator-psa
rules:
- apiGroups:
  - management.cattle.io
  resources:
  - projects
  verbs:
  - updatepsa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: trident-operator-psa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: trident-operator-psa
subjects:
- kind: ServiceAccount
  name: trident-operator
  namespace: trident

lindhe avatar Jul 05 '23 13:07 lindhe

We're running into the same issue after upgrading from Rancher 2.6.11 to 2.7.5. I can confirm that your workaround fixes the issue.

nheinemans avatar Jul 12 '23 04:07 nheinemans

@lindhe: Thanks for bringing this up and creating the corresponding pull request. I can confirm as well, that this solves the issue in my cluster.

Does NetApp has a plan to merge this at some point in time? Applying these workarounds in automation is a bit cumbersome and unclean.

Philbow avatar Aug 07 '23 15:08 Philbow

We're still seeing the same issue in Rancher 2.7.9 and Trident 23.10.0. Can we perhaps get an update from Netapp on this issue and the pending PR?

nheinemans-asml avatar Nov 29 '23 07:11 nheinemans-asml

@nheinemans-asml Could you try with v24.10.0? It's apparently resolved there, but I have no idea which PR that was.

lindhe avatar Nov 05 '24 10:11 lindhe

@lindhe I tested it with Rancher v2.9.2 and trident 24.10.0 is still an issue. After applying the workaround it suceeds:

kubectl describe torc trident 

Events:
  Type     Reason      Age                  From                        Message
  ----     ------      ----                 ----                        -------
  Normal   Installing  16m                  trident-operator.netapp.io  Installing Trident
  Warning  Failed      3m45s (x6 over 16m)  trident-operator.netapp.io  Failed to install Trident; err: failed to patch Trident installation namespace netapp-trident; admission webhook "rancher.cattle.io.namespaces" denied the request: Unauthorized
  Normal   Installed   27s                  trident-operator.netapp.io  Trident installed

betweenclouds avatar Nov 07 '24 09:11 betweenclouds

Hi @betweenclouds This should have been fixed in 24.10.0 as part of https://github.com/NetApp/trident/commit/5824103a201cb2f1be13f9435e554ad160c829b3

Can you try setting the forceInstallRancherClusterRoles: true in helm/trident-operator/values.yaml

sjpeeris avatar Nov 13 '24 02:11 sjpeeris

@sjpeeris Thank you, with forceInstallRancherClusterRoles=true the installation is sucessful, but only if I create a namespace named trident. Is this a expected behavior?

works:

helm install netapp-trident netapp-trident/trident-operator --version 100.2410.0 --create-namespace --namespace trident --set tridentDebug=true --set forceInstallRancherClusterRoles=true

does not work:

helm install netapp-trident netapp-trident/trident-operator --version 100.2410.0 --create-namespace --namespace netapp-trident --set tridentDebug=true --set forceInstallRancherClusterRoles=true

edit:

Namespace is hard-coded here: https://github.com/NetApp/trident/blob/master/helm/trident-operator/templates/clusterrolebinding-rancher.yaml#L13

instead of a variable like here: https://github.com/NetApp/trident/blob/master/helm/trident-operator/templates/clusterrolebinding.yaml#L10

betweenclouds avatar Nov 13 '24 07:11 betweenclouds

Hi @betweenclouds, you are correct. That namespace shouldn't be hard-coded. We will have this fixed in the next release. Thanks for pointing that out.

jharrod avatar Nov 14 '24 21:11 jharrod