rancher
rancher copied to clipboard
Longhorn 100.1.2+up1.2.4 stuck uninstalling in RKE1 Windows cluster
Rancher Server Setup
- Rancher version: v2.6.4-rc11
- Installation option (Docker install/Helm Chart): n/a
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
- Proxy/Cert Details:
Information about the Cluster
- Kubernetes version: v1.22.7
- Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Windows
Describe the bug
To Reproduce
- Install longhorn
100.1.2+up1.2.4 - Uninstall longhorn
- Check longhorn resources, specifically pods related to the longhorn-uninstall job
Result
The longhorn chart gets stuck uninstalling, seemingly due to resources being scheduled on windows nodes which may not be supported.

Looks installation in RKE2 Windows (v2.6.4-rc11, k8s 1.22.7) is failing too.

Helm logs
helm upgrade --install=true --namespace=longhorn-system --timeout=10m0s --values=/home/shell/helm/values-longhorn-crd-100.1.2-up1.2.4.yaml --version=100.1.2+up1.2.4 --wait=true longhorn-crd /home/shell/helm/longhorn-crd-100.1.2-up1.2.4.tgz
Release "longhorn-crd" does not exist. Installing it now.
creating 15 resource(s)
beginning wait for 15 resources with timeout of 10m0s
NAME: longhorn-crd
LAST DEPLOYED: Thu Mar 24 17:16:38 2022
NAMESPACE: longhorn-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
---------------------------------------------------------------------
SUCCESS: helm upgrade --install=true --namespace=longhorn-system --timeout=10m0s --values=/home/shell/helm/values-longhorn-crd-100.1.2-up1.2.4.yaml --version=100.1.2+up1.2.4 --wait=true longhorn-crd /home/shell/helm/longhorn-crd-100.1.2-up1.2.4.tgz
---------------------------------------------------------------------
helm upgrade --install=true --namespace=longhorn-system --timeout=10m0s --values=/home/shell/helm/values-longhorn-100.1.2-up1.2.4.yaml --version=100.1.2+up1.2.4 --wait=true longhorn /home/shell/helm/longhorn-100.1.2-up1.2.4.tgz
Release "longhorn" does not exist. Installing it now.
W0324 17:16:44.294929 47 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
creating 18 resource(s)
W0324 17:16:44.397770 47 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
beginning wait for 18 resources with timeout of 10m0s
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
...
Deployment is not ready: longhorn-system/longhorn-driver-deployer. 0 out of 1 expected pods are ready
Error: timed out waiting for the condition
Logs in longhorn-driver-deployer pod:
(combined from similar events): MountVolume.SetUp failed for volume "kube-api-access-7hrzc" : chown c:\var\lib\kubelet\pods\c9967814-78c9-4b41-b79b-05bd58028bf2\volumes\kubernetes.io~projected\kube-api-access-7hrzc\..2022_03_24_17_47_26.039614073\token: not supported by windows
cc @PhanLe1010 we allow users to install in a hybrid cluster but users need to use node selector to skip installing longhorn components on it.
https://longhorn.io/docs/1.2.4/advanced-resources/deploy/rancher_windows_cluster/
Yeah, please check if you have set the node selector and taint toleration as mentioned in the doc provided by @innobead
This was a found in a test case from this issue for allowing installation of charts in hybrid clusters without the need for setting tolerations/node-selectors manually. Sounds like this is not a regression, is just longhorn missing the right tolerations/node-selectors to account for the new behavior.
This was a found in a test case from this issue for allowing installation of charts in hybrid clusters without the need for setting tolerations/node-selectors manually. Sounds like this is not a regression, is just longhorn missing the right tolerations/node-selectors to account for the new behavior.
That means probably we should know what things we missed, then we can add to the next chart release to get benefit from https://github.com/rancher/dashboard/issues/5137
cc @meldafrawi
This should be re-tested following Team 3's additional annotations design, this should block from deploying on Windows nodes.
@ronhorton , based on the last comment on this ticket, we can test this now that our annotations had been merged.
@nickwsuse @nwilliams22 when working this ticket, reach out to brandon depesa in team-rancher-qa-team3 re: provisioning windows clusters
@MKlimuszka @ronhorton @sirredbeard This issue has not been fixed, the chart needs node-selectors, tolerations, and windows annotations to support hybrid clusters the same way we do in our other charts, instead of expecting the user to manually add them.
Example: Functions we use to apply node-selectors and tolerations in all charts that do not support Windows nodes: https://github.com/rancher/charts/blob/96cee30ac4ecb1741b41f8d30875f196a04c9ea1/charts/rancher-cis-benchmark/2.0.4/templates/_helpers.tpl#L14-L27 Annotation required in the chart to enable the behavior: https://github.com/rancher/charts/blob/96cee30ac4ecb1741b41f8d30875f196a04c9ea1/charts/rancher-cis-benchmark/2.0.4/Chart.yaml#L8
If the chart does have Windows components it can deploy into Windows nodes, then a few additional things are required. See rancher-monitoring for reference on this use case.
@innobead I'm assigning this to you per @yasker
Let's fix this in the upcoming 1.3.1 and 1.2.5.
- Add
permit-osas @PennyScissors mentioned - Add the default toleration and node selector (@PennyScissors mentioned) for user-deployed components (Manager, Driver, UI), this part can be handled in the chart manifests
- Need to update the code to introduce this implicit default toleration and node selector at runtime
This has been improved in upcoming Longhorn 1.4.0, 1.3.1, and 1.2.5. We will update the corresponding Rancher chart for 1.3.1 to resolve this issue because 1.3.1 is the first upcoming release later.
cc @rebeccazzzz
@innobead Since the Rancher chart for 1.3.1 has been released already - can this now be moved to-test for v2.6.9?
@innobead Since the Rancher chart for 1.3.1 has been released already - can this now be moved to-test for v2.6.9?
Yes, please.
Closing as verified
This was tested on RKE2 mixed windows cluster using Longhorn v1.31
Steps for validation on RKE2 Windows Cluster
- Deploy Longhorn Prerequesties into cluster
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: longhorn-iscsi-installation
labels:
app: longhorn-iscsi-installation
annotations:
command: &cmd OS=$(grep -E "^ID_LIKE=" /etc/os-release | cut -d '=' -f 2); if [[ -z "${OS}" ]]; then OS=$(grep -E "^ID=" /etc/os-release | cut -d '=' -f 2); fi; if [[ "${OS}" == *"debian"* ]]; then sudo apt-get update -q -y && sudo apt-get install -q -y open-iscsi && sudo systemctl -q enable iscsid && sudo systemctl start iscsid && sudo modprobe iscsi_tcp; elif [[ "${OS}" == *"suse"* ]]; then sudo zypper --gpg-auto-import-keys -q refresh && sudo zypper --gpg-auto-import-keys -q install -y open-iscsi && sudo systemctl -q enable iscsid && sudo systemctl start iscsid && sudo modprobe iscsi_tcp; else sudo yum makecache -q -y && sudo yum --setopt=tsflags=noscripts install -q -y iscsi-initiator-utils && echo "InitiatorName=$(/sbin/iscsi-iname)" > /etc/iscsi/initiatorname.iscsi && sudo systemctl -q enable iscsid && sudo systemctl start iscsid && sudo modprobe iscsi_tcp; fi && if [ $? -eq 0 ]; then echo "iscsi install successfully"; else echo "iscsi install failed error code $?"; fi
spec:
selector:
matchLabels:
app: longhorn-iscsi-installation
template:
metadata:
labels:
app: longhorn-iscsi-installation
spec:
hostNetwork: true
hostPID: true
initContainers:
- name: iscsi-installation
command:
- nsenter
- --mount=/proc/1/ns/mnt
- --
- bash
- -c
- *cmd
image: alpine:3.12
securityContext:
privileged: true
containers:
- name: sleep
image: k8s.gcr.io/pause:3.1
updateStrategy:
type: RollingUpdate
____
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: longhorn-nfs-installation
labels:
app: longhorn-nfs-installation
annotations:
command: &cmd OS=$(grep -E "^ID_LIKE=" /etc/os-release | cut -d '=' -f 2); if [[ -z "${OS}" ]]; then OS=$(grep -E "^ID=" /etc/os-release | cut -d '=' -f 2); fi; if [[ "${OS}" == *"debian"* ]]; then sudo apt-get update -q -y && sudo apt-get install -q -y nfs-common && sudo modprobe nfs; elif [[ "${OS}" == *"suse"* ]]; then sudo zypper --gpg-auto-import-keys -q refresh && sudo zypper --gpg-auto-import-keys -q install -y nfs-client && sudo modprobe nfs; else sudo yum makecache -q -y && sudo yum --setopt=tsflags=noscripts install -q -y nfs-utils && sudo modprobe nfs; fi && if [ $? -eq 0 ]; then echo "nfs install successfully"; else echo "nfs install failed error code $?"; fi
spec:
selector:
matchLabels:
app: longhorn-nfs-installation
template:
metadata:
labels:
app: longhorn-nfs-installation
spec:
hostNetwork: true
hostPID: true
initContainers:
- name: nfs-installation
command:
- nsenter
- --mount=/proc/1/ns/mnt
- --
- bash
- -c
- *cmd
image: alpine:3.12
securityContext:
privileged: true
containers:
- name: sleep
image: k8s.gcr.io/pause:3.1
updateStrategy:
type: RollingUpdate
- Install the Longorn v1.3.1 chart into the cluster, inside Edit YAML, edit the value for windowsCluster, enabled: true
global:
cattle:
systemDefaultRegistry: ""
windowsCluster:
# Enable this to allow Longhorn to run on the Rancher deployed Windows cluster
enabled: true
- Verify Longhorn UI is fucntional and properly displayed
- Uninstall the Longhorn chart from the UI by navigating to Installed Apps, first removing the longhorn chart and then the longhorn-crd
- The ui terminal is displayed for each chart, displaying the helm uninstall command and completes with success message
---------------------------------------------------------------------
SUCCESS: helm uninstall --namespace=longhorn-system longhorn
---------------------------------------------------------------------