loki icon indicating copy to clipboard operation
loki copied to clipboard

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode.

Open numa1985 opened this issue 1 year ago • 0 comments

Loki helm chart version upgrade from 5.44.4 to 6.5.0 issues in Single Binary deployment mode.

We are using azure Kubernetes service consisting of 1 system node in a system node pool and 3 user nodes in user node pool for deploying Loki .

Kubernetes Version : 1.29.2

  1. Affinity was working fine and all the pods were landing up in user node pool in 5.44.4,after upgrade to 6.5.0 when we set affinity we are encountering below error.

Error

coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]]) May 7th 2024 10:59:51Error coalesce.go:286: warning: cannot overwrite table with non table for loki.singleBinary.affinity (map[podAntiAffinity:map[requiredDuringSchedulingIgnoredDuringExecution:[map[labelSelector:map[matchLabels:map[app.kubernetes.io/component:single-binary]] topologyKey:kubernetes.io/hostname]]]]) May 7th 2024 10:59:51Error Error: UPGRADE FAILED: execution error at (loki/templates/validate.yaml:31:4): You have more than zero replicas configured for both the single binary and simple scalable targets. If this was intentional change the deploymentMode to the transitional 'SingleBinary<->SimpleScalable' mode May 7th 2024 10:59:51Error Helm Upgrade returned non-zero exit code: 1. Deployment terminated. May 7th 2024 10:59:51Fatal The remote script failed with exit code 1

ubuntu@NARU-Pr5530:~$ kubectl describe pod loki-chunks-cache-0 -n loki|tail -5 Type Reason Age From Message


Warning FailedScheduling 2m43s default-scheduler 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod. Warning FailedScheduling 2m42s default-scheduler 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod. Normal NotTriggerScaleUp 2m40s cluster-autoscaler pod didn't trigger scale-up: 1 max node group size reached

If we are not using affinity in version 6.5.0 ,few pods are landing up in the system node and ending with the resources issues and failing , and as well we don't pods to land up in system node. Is there any way to fix this ?

Values.yaml ( used in 5.44.4)

--- https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

loki: auth_enabled: false query_scheduler: max_outstanding_requests_per_tenant: 2048 query_range: parallelise_shardable_queries: false split_queries_by_interval: 0 commonConfig: replication_factor: 1 storage: type: filesystem

singleBinary: replicas: 1 persistence: size: 50Gi enableStatefulSetAutoDeletePVC: true affinity: | nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: kubernetes.azure.com/mode operator: In values: - user weight: 50

Values.yaml ( used in 6.5.0)

--- https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

deploymentMode: SingleBinary loki: auth_enabled: false query_scheduler: max_outstanding_requests_per_tenant: 2048 query_range: parallelise_shardable_queries: false limits_config: split_queries_by_interval: 0 commonConfig: replication_factor: 1 storage: type: filesystem schemaConfig: configs: - from: 2024-04-01 object_store: filesystem store: tsdb schema: v13 index: prefix: loki_index_ period: 24h ingester: chunk_encoding: snappy tracing: enabled: true querier: max_concurrent: 1

backend: replicas: 0 read: replicas: 0 write: replicas: 0

singleBinary: replicas: 1 persistence: size: 50Gi enableStatefulSetAutoDeletePVC: true enabled: true extraArgs: - -config.expand-env=true

chunksCache: allocatedMemory: 1024 writebackSizeLimit: 10MB

  1. After updrading to 6.5.0 the loki-0 pod going for crash loopback with below error.

Error

ubuntu@NARU-Pr5530:~$ kubectl logs loki-0 -n loki failed parsing config: /etc/loki/config/config.yaml: yaml: unmarshal errors: line 2: field Error not found in type loki.ConfigWrapper ubuntu@NARU-Pr5530:~$

ubuntu@NARU-Pr5530:~$ kubectl get pods -n loki -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES loki-0 0/1 CrashLoopBackOff 1 (11s ago) 74s 10.101.80.28 aks-npu2-21504394-vmss00000f loki-canary-6qngw 1/1 Running 0 74s 10.101.80.158 aks-npsystem01-10976478-vmss000000 loki-canary-6v6bz 1/1 Running 0 75s 10.101.80.136 aks-npu2-21504394-vmss000000 loki-canary-krnqv 1/1 Running 0 75s 10.101.80.240 aks-npu2-21504394-vmss00000f loki-canary-twcl5 1/1 Running 0 75s 10.101.80.213 aks-npu2-21504394-vmss00000h loki-chunks-cache-0 0/2 Pending 0 74s loki-gateway-668c5dff6c-l7hd5 1/1 Running 0 74s 10.101.80.173 aks-npsystem01-10976478-vmss000000 loki-results-cache-0 2/2 Running 0 74s 10.101.80.175 aks-npsystem01-10976478-vmss000000

Kindly do the needful.

Thanks Naresh

numa1985 avatar May 08 '24 08:05 numa1985