rancher icon indicating copy to clipboard operation
rancher copied to clipboard

Longhorn 100.1.2+up1.2.4 stuck uninstalling in RKE1 Windows cluster

Open pennyscissors opened this issue 3 years ago • 12 comments

Rancher Server Setup

  • Rancher version: v2.6.4-rc11
  • Installation option (Docker install/Helm Chart): n/a
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: v1.22.7
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Windows

Describe the bug

To Reproduce

  • Install longhorn100.1.2+up1.2.4
  • Uninstall longhorn
  • Check longhorn resources, specifically pods related to the longhorn-uninstall job

Result The longhorn chart gets stuck uninstalling, seemingly due to resources being scheduled on windows nodes which may not be supported. image

pennyscissors avatar Mar 24 '22 17:03 pennyscissors

Looks installation in RKE2 Windows (v2.6.4-rc11, k8s 1.22.7) is failing too.

image

Helm logs

helm upgrade --install=true --namespace=longhorn-system --timeout=10m0s --values=/home/shell/helm/values-longhorn-crd-100.1.2-up1.2.4.yaml --version=100.1.2+up1.2.4 --wait=true longhorn-crd /home/shell/helm/longhorn-crd-100.1.2-up1.2.4.tgz
Release "longhorn-crd" does not exist. Installing it now.
creating 15 resource(s)
beginning wait for 15 resources with timeout of 10m0s
NAME: longhorn-crd
LAST DEPLOYED: Thu Mar 24 17:16:38 2022
NAMESPACE: longhorn-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

---------------------------------------------------------------------
SUCCESS: helm upgrade --install=true --namespace=longhorn-system --timeout=10m0s --values=/home/shell/helm/values-longhorn-crd-100.1.2-up1.2.4.yaml --version=100.1.2+up1.2.4 --wait=true longhorn-crd /home/shell/helm/longhorn-crd-100.1.2-up1.2.4.tgz
---------------------------------------------------------------------
helm upgrade --install=true --namespace=longhorn-system --timeout=10m0s --values=/home/shell/helm/values-longhorn-100.1.2-up1.2.4.yaml --version=100.1.2+up1.2.4 --wait=true longhorn /home/shell/helm/longhorn-100.1.2-up1.2.4.tgz
Release "longhorn" does not exist. Installing it now.
W0324 17:16:44.294929      47 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
creating 18 resource(s)
W0324 17:16:44.397770      47 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
beginning wait for 18 resources with timeout of 10m0s
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
...
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
Error: timed out waiting for the condition

Logs in longhorn-driver-deployer pod:

 (combined from similar events): MountVolume.SetUp failed for volume "kube-api-access-7hrzc" : chown c:\var\lib\kubelet\pods\c9967814-78c9-4b41-b79b-05bd58028bf2\volumes\kubernetes.io~projected\kube-api-access-7hrzc\..2022_03_24_17_47_26.039614073\token: not supported by windows

pennyscissors avatar Mar 24 '22 17:03 pennyscissors

cc @PhanLe1010 we allow users to install in a hybrid cluster but users need to use node selector to skip installing longhorn components on it.

https://longhorn.io/docs/1.2.4/advanced-resources/deploy/rancher_windows_cluster/

innobead avatar Mar 24 '22 22:03 innobead

Yeah, please check if you have set the node selector and taint toleration as mentioned in the doc provided by @innobead

PhanLe1010 avatar Mar 24 '22 22:03 PhanLe1010

This was a found in a test case from this issue for allowing installation of charts in hybrid clusters without the need for setting tolerations/node-selectors manually. Sounds like this is not a regression, is just longhorn missing the right tolerations/node-selectors to account for the new behavior.

pennyscissors avatar Mar 24 '22 23:03 pennyscissors

This was a found in a test case from this issue for allowing installation of charts in hybrid clusters without the need for setting tolerations/node-selectors manually. Sounds like this is not a regression, is just longhorn missing the right tolerations/node-selectors to account for the new behavior.

That means probably we should know what things we missed, then we can add to the next chart release to get benefit from https://github.com/rancher/dashboard/issues/5137

cc @meldafrawi

innobead avatar Mar 25 '22 05:03 innobead

This should be re-tested following Team 3's additional annotations design, this should block from deploying on Windows nodes.

sirredbeard avatar Jun 07 '22 15:06 sirredbeard

@ronhorton , based on the last comment on this ticket, we can test this now that our annotations had been merged.

MKlimuszka avatar Jun 28 '22 17:06 MKlimuszka

@nickwsuse @nwilliams22 when working this ticket, reach out to brandon depesa in team-rancher-qa-team3 re: provisioning windows clusters

ronhorton avatar Jul 08 '22 18:07 ronhorton

@MKlimuszka @ronhorton @sirredbeard This issue has not been fixed, the chart needs node-selectors, tolerations, and windows annotations to support hybrid clusters the same way we do in our other charts, instead of expecting the user to manually add them.

Example: Functions we use to apply node-selectors and tolerations in all charts that do not support Windows nodes: https://github.com/rancher/charts/blob/96cee30ac4ecb1741b41f8d30875f196a04c9ea1/charts/rancher-cis-benchmark/2.0.4/templates/_helpers.tpl#L14-L27 Annotation required in the chart to enable the behavior: https://github.com/rancher/charts/blob/96cee30ac4ecb1741b41f8d30875f196a04c9ea1/charts/rancher-cis-benchmark/2.0.4/Chart.yaml#L8

If the chart does have Windows components it can deploy into Windows nodes, then a few additional things are required. See rancher-monitoring for reference on this use case.

pennyscissors avatar Jul 15 '22 18:07 pennyscissors

@innobead I'm assigning this to you per @yasker

brandonsuse avatar Jul 15 '22 19:07 brandonsuse

Let's fix this in the upcoming 1.3.1 and 1.2.5.

  • Add permit-os as @PennyScissors mentioned
  • Add the default toleration and node selector (@PennyScissors mentioned) for user-deployed components (Manager, Driver, UI), this part can be handled in the chart manifests
  • Need to update the code to introduce this implicit default toleration and node selector at runtime

innobead avatar Jul 18 '22 07:07 innobead

This has been improved in upcoming Longhorn 1.4.0, 1.3.1, and 1.2.5. We will update the corresponding Rancher chart for 1.3.1 to resolve this issue because 1.3.1 is the first upcoming release later.

cc @rebeccazzzz

innobead avatar Jul 26 '22 03:07 innobead

@innobead Since the Rancher chart for 1.3.1 has been released already - can this now be moved to-test for v2.6.9?

prachidamle avatar Sep 08 '22 19:09 prachidamle

@innobead Since the Rancher chart for 1.3.1 has been released already - can this now be moved to-test for v2.6.9?

Yes, please.

innobead avatar Sep 08 '22 22:09 innobead

Closing as verified

This was tested on RKE2 mixed windows cluster using Longhorn v1.31

Steps for validation on RKE2 Windows Cluster

  1. Deploy Longhorn Prerequesties into cluster
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: longhorn-iscsi-installation
  labels:
    app: longhorn-iscsi-installation
  annotations:
    command: &cmd OS=$(grep -E "^ID_LIKE=" /etc/os-release | cut -d '=' -f 2); if [[ -z "${OS}" ]]; then OS=$(grep -E "^ID=" /etc/os-release | cut -d '=' -f 2); fi; if [[ "${OS}" == *"debian"* ]]; then sudo apt-get update -q -y && sudo apt-get install -q -y open-iscsi && sudo systemctl -q enable iscsid && sudo systemctl start iscsid && sudo modprobe iscsi_tcp; elif [[ "${OS}" == *"suse"* ]]; then sudo zypper --gpg-auto-import-keys -q refresh && sudo zypper --gpg-auto-import-keys -q install -y open-iscsi && sudo systemctl -q enable iscsid && sudo systemctl start iscsid && sudo modprobe iscsi_tcp; else sudo yum makecache -q -y && sudo yum --setopt=tsflags=noscripts install -q -y iscsi-initiator-utils && echo "InitiatorName=$(/sbin/iscsi-iname)" > /etc/iscsi/initiatorname.iscsi && sudo systemctl -q enable iscsid && sudo systemctl start iscsid && sudo modprobe iscsi_tcp; fi && if [ $? -eq 0 ]; then echo "iscsi install successfully"; else echo "iscsi install failed error code $?"; fi
spec:
  selector:
    matchLabels:
      app: longhorn-iscsi-installation
  template:
    metadata:
      labels:
        app: longhorn-iscsi-installation
    spec:
      hostNetwork: true
      hostPID: true
      initContainers:
      - name: iscsi-installation
        command:
          - nsenter
          - --mount=/proc/1/ns/mnt
          - --
          - bash
          - -c
          - *cmd
        image: alpine:3.12
        securityContext:
          privileged: true
      containers:
      - name: sleep
        image: k8s.gcr.io/pause:3.1
  updateStrategy:
    type: RollingUpdate
____
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: longhorn-nfs-installation
  labels:
    app: longhorn-nfs-installation
  annotations:
    command: &cmd OS=$(grep -E "^ID_LIKE=" /etc/os-release | cut -d '=' -f 2); if [[ -z "${OS}" ]]; then OS=$(grep -E "^ID=" /etc/os-release | cut -d '=' -f 2); fi; if [[ "${OS}" == *"debian"* ]]; then sudo apt-get update -q -y && sudo apt-get install -q -y nfs-common && sudo modprobe nfs; elif [[ "${OS}" == *"suse"* ]]; then sudo zypper --gpg-auto-import-keys -q refresh && sudo zypper --gpg-auto-import-keys -q install -y nfs-client && sudo modprobe nfs; else sudo yum makecache -q -y && sudo yum --setopt=tsflags=noscripts install -q -y nfs-utils && sudo modprobe nfs; fi && if [ $? -eq 0 ]; then echo "nfs install successfully"; else echo "nfs install failed error code $?"; fi
spec:
  selector:
    matchLabels:
      app: longhorn-nfs-installation
  template:
    metadata:
      labels:
        app: longhorn-nfs-installation
    spec:
      hostNetwork: true
      hostPID: true
      initContainers:
      - name: nfs-installation
        command:
          - nsenter
          - --mount=/proc/1/ns/mnt
          - --
          - bash
          - -c
          - *cmd
        image: alpine:3.12
        securityContext:
          privileged: true
      containers:
      - name: sleep
        image: k8s.gcr.io/pause:3.1
  updateStrategy:
    type: RollingUpdate
  1. Install the Longorn v1.3.1 chart into the cluster, inside Edit YAML, edit the value for windowsCluster, enabled: true
global:
  cattle:
    systemDefaultRegistry: ""
    windowsCluster:
      # Enable this to allow Longhorn to run on the Rancher deployed Windows cluster
      enabled: true
  1. Verify Longhorn UI is fucntional and properly displayed
  2. Uninstall the Longhorn chart from the UI by navigating to Installed Apps, first removing the longhorn chart and then the longhorn-crd
  3. The ui terminal is displayed for each chart, displaying the helm uninstall command and completes with success message
---------------------------------------------------------------------
SUCCESS: helm uninstall --namespace=longhorn-system longhorn
---------------------------------------------------------------------

MSpencer87 avatar Sep 20 '22 16:09 MSpencer87