helm-charts
helm-charts copied to clipboard
[loki-stack] loki statefulSet with persistence enabled doesn't start right away due to not recognizing PVC is already bound to it
Environment
k8s cluster version: 1.22.6 helm version: 3.8.1 loki-stack chart version: 2.6.1 Cloud provider: Azure
Description
When deploying loki-stack
chart with loki.persistence.enabled
= true
and storageClass kubernetes.io/azure-disk
the following behavior is observed:
- Seemingly the PVC is created prior to the statefulset
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 19m persistentvolume-controller waiting for first consumer to be created before binding
Normal ExternalProvisioning 19m (x2 over 19m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "disk.csi.azure.com" or manually created by system administrator
Normal Provisioning 19m disk.csi.azure.com_csi-azuredisk-controller-6bcf6bbf9c-lrz89_a2e6446d-133e-44fd-b15d-3f886f2812da External provisioner is provisioning volume for claim "loki/storage-loki-stack-0"
Normal ProvisioningSucceeded 19m disk.csi.azure.com_csi-azuredisk-controller-6bcf6bbf9c-lrz89_a2e6446d-133e-44fd-b15d-3f886f2812da Successfully provisioned volume pvc-1a45d6c9-ad33-4eac-b0d0-502a83fc3d4b
- Statefulset starts up
- Errors prevent statefulset from starting up:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned loki/loki-stack-0 to aks-system1-31394726-vmss000001
Normal SuccessfulAttachVolume 19m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-1a45d6c9-ad33-4eac-b0d0-502a83fc3d4b"
Warning FailedMount 17m kubelet MountVolume.MountDevice failed for volume "pvc-1a45d6c9-ad33-4eac-b0d0-502a83fc3d4b" : rpc error: code = Aborted desc = An operation with the given Volume ID /subscriptions/40a4dfee-81c0-4199-8302-3184f180a0b6/resourceGroups/mc_ex-ai-hub-cluster-eastus-dev-cluster-1_cluster-1_eastus/providers/Microsoft.Compute/disks/pvc-1a45d6c9-ad33-4eac-b0d0-502a83fc3d4b already exists
- Statefulset restarts a few time and eventually recognizes the PVC:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 23m default-scheduler Successfully assigned loki/loki-stack-0 to aks-system1-31394726-vmss000001
Normal SuccessfulAttachVolume 23m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-1a45d6c9-ad33-4eac-b0d0-502a83fc3d4b"
Warning FailedMount 21m kubelet MountVolume.MountDevice failed for volume "pvc-1a45d6c9-ad33-4eac-b0d0-502a83fc3d4b" : rpc error: code = Aborted desc = An operation with the given Volume ID /subscriptions/40a4dfee-81c0-4199-8302-3184f180a0b6/resourceGroups/mc_ex-ai-hub-cluster-eastus-dev-cluster-1_cluster-1_eastus/providers/Microsoft.Compute/disks/pvc-1a45d6c9-ad33-4eac-b0d0-502a83fc3d4b already exists
Warning FailedMount 8m25s kubelet Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[storage kube-api-access-tz8kb config]: timed out waiting for the condition
Warning FailedMount 5m42s (x8 over 21m) kubelet MountVolume.MountDevice failed for volume "pvc-1a45d6c9-ad33-4eac-b0d0-502a83fc3d4b" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 3m53s (x8 over 21m) kubelet Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[config storage kube-api-access-tz8kb]: timed out waiting for the condition
Normal Pulling 2m32s kubelet Pulling image "grafana/loki:2.4.2"
Normal Pulled 2m30s kubelet Successfully pulled image "grafana/loki:2.4.2" in 1.64499166s
Normal Created 2m30s kubelet Created container loki
Normal Started 2m30s kubelet Started container loki
Warning Unhealthy 88s (x2 over 98s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 88s (x2 over 98s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
Expected behavior
Stateful set starts up and recognizes the allocated volume claim without having to spend x cycles to resolve correct state
We've been having the same issue on multiple clusters. ~Same setup as aicball.
Same issue on all our EKS clusters. Are there any updates about this one ?
I am facing the same issue in azure aks
│ Events: │ │ Type Reason Age From Message │ │ ---- ------ ---- ---- ------- │ │ Normal Scheduled 39m default-scheduler Successfully assigned loki/loki-0 to aks-system--vmss000016 │ │ Warning FailedAttachVolume 39m attachdetach-controller Multi-Attach error for volume "pvc-aed4fcc2-31b5-40ac-92cc-" Volume is already exclusively attached to one node and can't be attached to ano │ │ ther │ │ Normal SuccessfulAttachVolume 38m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-aed4fcc2-31b5-40ac-92cc-*" │ │ Warning FailedMount 32m kubelet Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[storage kube-api-access-kz622 tmp config]: timed out waiting for the │ │ condition │ │ Warning FailedMount 16m (x2 over 30m) kubelet Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[kube-api-access-kz622 tmp config storage]: timed out waiting for the │ │ condition │ │ Warning FailedMount 3m1s (x6 over 37m) kubelet Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[config storage kube-api-access-kz622 tmp]: timed out waiting for the │ │ condition │ │ Warning FailedMount 44s (x8 over 34m) kubelet Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[tmp config storage kube-api-access-kz622]: timed out waiting for the │ │ condition
Those aks node kernel logs also seem to be related:
E0602 12:37:30.978546 3703 driver-call.go:262] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input W0602 12:37:30.978577 3703 driver-call.go:149] FlexVolume: driver call failed: executable: /etc/kubernetes/volumeplugins/nodeagent~uds/uds, args: [init], error: fork/exec /etc/kubernetes/volumeplugins/nodeagent-uds/uds: no such file E0602 12:37:30.978607 3703 plugins.go:750] "Error dynamically probing plugins" err="error creating Flexvolume plugin from directory nodeagent-uds, skipping. Error: unexpected end of JSON input"
I opened a microsoft ticket. If i get some info on their side, i will post it here :)
First call with Microsoft didn't really bring a solution but maybe someone can confirm this:
the deployment yaml of the persistent volume claim storage-loki-0 has a wrong value in volume.kubernetes.io/selected-node. in my case:
metadata:
annotations:
volume.kubernetes.io/selected-node: aks-system-34716058-vmss000000
and it should be
metadata:
annotations:
volume.kubernetes.io/selected-node: aks-system-34716058-vmss000018
microsoft will call me, when they got further infos for this.
Update: Microsoft closed the issue and is pretty sure it's not a fault on their side. :/
Same here, running loki on a GKE cluster. Any workarounds on that?
@Mahagon Yes, on my cluster the PVC also had a reference to a no-longer-existing node.
We're seeing the same issue with our helm upgrades where the storage volume isn't mounting for Loki's use in civo cloud provider as well. The issue has been on loki-stack 2.6.1 to 2.7.1 where it cannot remount the storage volume.
k get pvc -n loki storage-loki-stack-0 -o json | jq .metadata.annotations
{
"pv.kubernetes.io/bind-completed": "yes",
"pv.kubernetes.io/bound-by-controller": "yes",
"volume.beta.kubernetes.io/storage-provisioner": "csi.civo.com",
"volume.kubernetes.io/selected-node": "k3s-test01-c25a-82dc3e-node-pool-c419-ojax1"
}
One of the nodes was no longer available from scaling so manually updated selected-node to a new one but no difference.
Also tried rolling back to previous good, then back to latest 2.7.0 with no luck
helm history -n loki loki-stack
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Wed Jun 8 03:46:24 2022 superseded loki-stack-2.6.4 v2.4.2 Install complete
2 Fri Jun 17 16:59:11 2022 superseded loki-stack-2.6.5 v2.4.2 Upgrade complete
3 Sat Jul 2 04:03:00 2022 superseded loki-stack-2.6.5 v2.4.2 Upgrade complete
4 Tue Aug 16 05:39:39 2022 failed loki-stack-2.6.8 v2.4.2 Upgrade "loki-stack" failed: timed out waiting for the condition
5 Tue Aug 16 08:39:40 2022 failed loki-stack-2.6.9 v2.4.2 Upgrade "loki-stack" failed: timed out waiting for the condition
6 Tue Aug 16 17:39:40 2022 failed loki-stack-2.7.0 v2.4.2 Upgrade "loki-stack" failed: client rate limiter Wait returned an error: context deadline exceeded
7 Fri Aug 19 13:11:20 2022 deployed loki-stack-2.6.5 v2.4.2 Rollback to 3
8 Fri Aug 19 17:13:03 2022 failed loki-stack-2.7.0 v2.4.2 Upgrade "loki-stack" failed: timed out waiting for the condition
uninstalled the whole chart hoping then it could reattach to the pvc on 2.7.1 without success.
helm ls -n loki
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
loki-stack loki 1 2022-08-19 23:59:46.546508141 +0000 UTC failed loki-stack-2.7.1 v2.6.1
The result for events the same so not I'm not sure what to do besides delete the pvc
3m1s Normal ArtifactUpToDate helmchart/loki-loki-stack artifact up-to-date with remote revision: '2.7.1'
2m42s Warning FailedMount pod/loki-stack-0 Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[kube-api-access-v4r7s tmp config storage]: timed out waiting for the condition
84s Warning FailedAttachVolume pod/loki-stack-0 AttachVolume.Attach failed for volume "pvc-7aa60242-e4b9-49dd-82e2-edc0c4d21c3e" : rpc error: code = Unknown desc = DatabaseVolumeNotFoundError: Failed to find the volume within the internal database
28s Warning FailedMount pod/loki-stack-0 Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[config storage kube-api-access-v4r7s tmp]: timed out waiting for the condition
Any update on this ticket? i have same issue on GKE cluster. Detail my issue:
k8s cluster version: 1.21.12-gke.2200 loki statesfulset image: grafana/loki:2.3.0 loki-stack chart version: 2.5.0 Cloud provider: GCP
Describe pods:
Unable to attach or mount volumes: unmounted volumes=[storage], unattached volumes=[kube-api-access-cjwvv config storage]: timed out waiting for the condition
same issue here
I have been having this problem for months every time I need to upgrade and I always end up uninstalling and reinstalling, losing the existing logs. Is there a solution at all?
@vitobotta in my case it would eventually resolves itself, I have simply switched to installing all the charts separately straight from grafana's repo (loki, grafana)
@vitobotta in my case it would eventually resolves itself, I have simply switched to installing all the charts separately straight from grafana's repo (loki, grafana)
Sometimes I managed to fix by scaling the sts to zero, deleting the volumeattachment if still there, and scaling up again but it doesn't work always.
We are currently changing our setting for the fsGroupChangePolicy to OnRootMismatch because we believe this might be why we face the issue.
We think that the volumne gets mounted and due to a large amount of files, k8s is changing all permissions which takes ages and sometimes / too often lead to timeouts.
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
As no other pod is using that fs and loki controls it and we are not changing the security group regularly at all, that shouldn't be an issue to change.
@damyan we are also running on GKE, in case this works or you find something, please give me highlight!
Same problem for me..