loki icon indicating copy to clipboard operation
loki copied to clipboard

Crashlookbackoff with "recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference""

Open sudip-moengage opened this issue 4 years ago • 32 comments

Describe the bug Service runs for a day or two later it goes on to crashloopbackoff. It fails with following logs:

2021-01-27 18:54:42.026761 I | proto: duplicate proto type registered: ingester.Series
level=info ts=2021-01-27T18:54:42.030318392Z caller=main.go:130 msg="Starting Loki" version="(version=k41-fd619b6, branch=k41, revision=fd619b69)"
level=info ts=2021-01-27T18:54:42.030614885Z caller=server.go:229 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2021-01-27T18:54:42.03103619Z caller=table_manager.go:241 msg="loading table index_18652"
level=error ts=2021-01-27T18:54:42.031471165Z caller=log.go:150 msg="error running loki" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:179\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:308\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:103\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:209\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nerror initialising module: store\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:105\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:209\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"

To Reproduce Don't know, I see different errors every week. Each time I clean up the disk and use a fresh installation, loki runs fine for a day or two. Config is same as shared here: https://github.com/grafana/loki/issues/3219

Expected behavior It should come up as usual. If I clean up the disk it comes up fine.

Environment: Infrastructure: kubernetes Deployment tool: helm Am I doing wrong configuration?

sudip-moengage avatar Jan 27 '21 19:01 sudip-moengage

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 20 '21 00:03 stale[bot]

I experience the same problem right now on a system which was running for a few weeks now.

tricky42 avatar Apr 16 '21 17:04 tricky42

Same issue here. We are using Loki on Azure AKS. The storageClass is azurefile.

woeye avatar May 19 '21 06:05 woeye

Can you share the index file causing the error please ?

cyriltovena avatar May 31 '21 06:05 cyriltovena

Same issue here. Running loki on ceph for 30+ days without crashes before this one. Redeployed and the error's gone. Don't know what's a possible cause.

simeonexo avatar May 31 '21 12:05 simeonexo

Same over here...

voigtste avatar Jul 08 '21 11:07 voigtste

This should be fixed in the upcoming 2.3 release. Please let us know it if persists.

owen-d avatar Jul 08 '21 13:07 owen-d

Same here !

axelmarciano avatar Jul 25 '21 18:07 axelmarciano

Experienced the same issue when running out of disk space.

Tried to run with -log.level=debug in order to get the name of the exact corrupted file, unfortunately it add no extra logs.

@owen-d Could you please provide with more details, such as when 2.3 should be out, or what is the PR/Commit that fixes it?

dvdlevanon avatar Aug 01 '21 08:08 dvdlevanon

Ran into the same issue running out of disk space. I resized the volume and removed and re-deployed the Helm install and it's still crashing. I attached a busybox container to the volume and validated it had additional disk space. Mine fails with the following log lines, is there a file I can remove to restart the process? Or should I attach an additional volume to have tiered storage?

level=info ts=2021-08-05T18:44:20.92045892Z caller=table_manager.go:241 msg="loading table index_18831"
level=error ts=2021-08-05T18:44:20.921204424Z caller=log.go:106 msg="error running loki" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:179\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:319\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:103\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:220\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nerror initialising module: store\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:105\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:220\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"

dpiessens avatar Aug 05 '21 18:08 dpiessens

@dpiessens This is not related to disk size IMO

sudip-moengage avatar Aug 17 '21 22:08 sudip-moengage

Hi @owen-d , was getting same errors when i resized volume and restarted pod in AKS(loki version 2.2.0). Then upgraded to 2.3.0 but now new errors appeared loki_error.log .

singh-navpreet avatar Sep 29 '21 15:09 singh-navpreet

any update here? have the same issue with the same setup worked for some time being and crashing now. raised an incident https://github.com/grafana/helm-charts/issues/1730 for the same. anyone able to get over this issue?

saigha01 avatar Aug 26 '22 09:08 saigha01

I'm getting the same issue since a few days using: grafana/loki:2.6.1

Here is my config:

auth_enabled: false
chunk_store_config:
  max_look_back_period: 0s
compactor:
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  retention_enabled: true
  shared_store: filesystem
  working_directory: /data/loki/boltdb-shipper-compactor
ingester:
  chunk_block_size: 262144
  chunk_idle_period: 3m
  chunk_retain_period: 1m
  lifecycler:
    ring:
      replication_factor: 1
  max_transfer_retries: 0
  wal:
    dir: /data/loki/wal
limits_config:
  enforce_metric_name: false
  max_entries_limit_per_query: 5000
  per_stream_rate_limit: 8M
  per_stream_rate_limit_burst: 16M
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 168h
memberlist:
  join_members:
  - 'loki-stack-memberlist'
schema_config:
  configs:
  - from: "2020-10-24"
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v11
    store: boltdb-shipper
server:
  grpc_listen_port: 9095
  http_listen_port: 3100
storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /data/loki/chunks
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Here is the stack trace:

level=error ts=2022-08-30T07:00:51.123943705Z caller=index_set.go:265 table-name=index_19233 msg="sync failed, retrying it" err=timeout
level=error ts=2022-08-30T07:00:51.124066597Z caller=index_set.go:104 table-name=index_19233 msg="failed to initialize table index_19233, cleaning it up" err=timeout
timeout
error creating index client
github.com/grafana/loki/pkg/storage.(*store).storeForPeriod
	/src/loki/pkg/storage/store.go:245
github.com/grafana/loki/pkg/storage.(*store).init
	/src/loki/pkg/storage/store.go:162
github.com/grafana/loki/pkg/storage.NewStore
	/src/loki/pkg/storage/store.go:145
github.com/grafana/loki/pkg/loki.(*Loki).initStore
	/src/loki/pkg/loki/modules.go:522
github.com/grafana/dskit/modules.(*Manager).initModule
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:120
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92
github.com/grafana/loki/pkg/loki.(*Loki).Run
	/src/loki/pkg/loki/loki.go:341
main.main
	/src/loki/cmd/loki/main.go:105
runtime.main
	/usr/local/go/src/runtime/proc.go:255
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1581
error initialising module: store
github.com/grafana/dskit/modules.(*Manager).initModule
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:122
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92
github.com/grafana/loki/pkg/loki.(*Loki).Run
	/src/loki/pkg/loki/loki.go:341
main.main
	/src/loki/cmd/loki/main.go:105
runtime.main
	/usr/local/go/src/runtime/proc.go:255
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1581
level=error ts=2022-08-30T07:00:51.124647905Z caller=log.go:103 msg="error running loki" err="timeout\nerror creating index client\ngithub.com/grafana/loki/pkg/storage.(*store).storeForPeriod\n\t/src/loki/pkg/storage/store.go:245\ngithub.com/grafana/loki/pkg/storage.(*store).init\n\t/src/loki/pkg/storage/store.go:162\ngithub.com/grafana/loki/pkg/storage.NewStore\n\t/src/loki/pkg/storage/store.go:145\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:522\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:120\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:341\nmain.main\n\t/src/loki/cmd/loki/main.go:105\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\nerror initialising module: store\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:122\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:341\nmain.main\n\t/src/loki/cmd/loki/main.go:105\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"

jocelynthode avatar Aug 30 '22 07:08 jocelynthode

I am getting the same issue using grafana/loki-stack:2.8.2

here is the log error:

image

claratang98 avatar Sep 09 '22 05:09 claratang98

I am getting the same issue

level=error ts=2022-09-15T03:54:52.007356924Z caller=index_set.go:265 table-name=index_19249 msg="sync failed, retrying it" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2022-09-15T03:54:52.007405845Z caller=index_set.go:104 table-name=index_19249 msg="failed to initialize table index_19249, cleaning it up" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference
error creating index client

vasyakrg avatar Sep 15 '22 03:09 vasyakrg

Hi, any progress on this? The same error here with chart loki-simple-scalable-1.8.11 :(

xzycn avatar Sep 29 '22 07:09 xzycn

Same issue here on loki chart loki-3.3.1

darox avatar Nov 08 '22 08:11 darox

Same issue here on Loki 2.4.2. Any workaround?

davidmateos avatar Dec 21 '22 16:12 davidmateos

Loki's pod falls off after a while

$ kubectl version -o yaml

clientVersion:
  buildDate: "2022-05-03T13:46:05Z"
  compiler: gc
  gitCommit: 4ce5a8954017644c5420bae81d72b09b735c21f0
  gitTreeState: clean
  gitVersion: v1.24.0
  goVersion: go1.18.1
  major: "1"
  minor: "24"
  platform: linux/amd64
kustomizeVersion: v4.5.4

serverVersion:
  buildDate: "2022-08-17T18:47:37Z"
  compiler: gc
  gitCommit: 95ee5ab382d64cfe6c28967f36b53970b8374491
  gitTreeState: clean
  gitVersion: v1.24.4
  goVersion: go1.18.5
  major: "1"
  minor: "24"
  platform: linux/amd64

$ helm version

version.BuildInfo{Version:"v3.10.1", GitCommit:"9f88ccb6aee40b9a0535fcc7efea6055e1ef72c9", GitTreeState:"clean", GoVersion:"go1.18.7"}

I added a Helm chart from the documentation https://grafana.com/docs/loki/latest/installation/helm/install-scalable/ and https://github.com/grafana/helm-charts

$ helm repo add grafana https://grafana.github.io/helm-charts
$ helm repo update

changed values

loki:
  persistence:
    enabled: true
    storageClassName: nfs-client
    size: 30Gi
promtail:
  tolerations:
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/controlplane
    operator: Exists

installation

$ helm install loki-stack grafana/loki-stack -n monitoring -f loki-values.yaml

status

$ helm ls 

NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
loki-stack              monitoring      2               2023-01-10 10:59:07.569498213 +0500 +05 deployed        loki-stack-2.8.9                v2.6.1

pod is working

$ k get pod

NAME                                                     READY   STATUS    RESTARTS        AGE
loki-stack-0                                             1/1     Running   0               59m
loki-stack-promtail-68jnv                                1/1     Running   0               55m
loki-stack-promtail-97c72                                1/1     Running   0               55m
loki-stack-promtail-kp9g8                                1/1     Running   0               67m
loki-stack-promtail-krzv6                                1/1     Running   0               67m
loki-stack-promtail-xzrfg                                1/1     Running   0               67m
$ k get pvc

NAME                                                                                                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
storage-loki-stack-0                                                                                     Bound    pvc-a9daec92-b541-47a9-b59f-72996377ab02   30Gi       RWO            nfs-client     59m
$ k exec -it loki-stack-0 -- sh
$ df -h

Filesystem                Size      Used Available Use% Mounted on
overlay                  97.3G     17.1G     76.1G  18% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                     5.8G         0      5.8G   0% /sys/fs/cgroup
/dev/mapper/ubuntu--vg-ubuntu--lv
                         97.3G     17.1G     76.1G  18% /tmp
172.30.44.157:/mnt/nfs_k8s_dynamic/monitoring-storage-loki-stack-0-pvc-a9daec92-b541-47a9-b59f-72996377ab02
                        195.8G      5.1G    182.6G   3% /data
tmpfs                    11.6G      4.0K     11.6G   0% /etc/loki
/dev/mapper/ubuntu--vg-ubuntu--lv
                         97.3G     17.1G     76.1G  18% /dev/termination-log
/dev/mapper/ubuntu--vg-ubuntu--lv
                         97.3G     17.1G     76.1G  18% /etc/resolv.conf
/dev/mapper/ubuntu--vg-ubuntu--lv
                         97.3G     17.1G     76.1G  18% /etc/hostname
/dev/mapper/ubuntu--vg-ubuntu--lv
                         97.3G     17.1G     76.1G  18% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                    11.6G     12.0K     11.6G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     5.8G         0      5.8G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                    64.0M         0     64.0M   0% /proc/sched_debug
tmpfs                     5.8G         0      5.8G   0% /proc/scsi
tmpfs                     5.8G         0      5.8G   0% /sys/firmware

after a while

$ k describe pod loki-stack-0

Name:         loki-stack-0
Namespace:    monitoring
Priority:     0
Node:         ds-worker1/172.30.44.165
Start Time:   Tue, 10 Jan 2023 14:51:12 +0500
Labels:       app=loki
              controller-revision-hash=loki-stack-65cb46884c
              name=loki-stack
              release=loki-stack
              statefulset.kubernetes.io/pod-name=loki-stack-0
Annotations:  checksum/config: b7426dc30bbac01e54c43af7d1e183312a6d75e0203d448892a9ff0a3cd46f6e
              kubectl.kubernetes.io/restartedAt: 2023-01-10T14:51:09+05:00
              prometheus.io/port: http-metrics
              prometheus.io/scrape: true
Status:       Running
IP:           10.42.3.237
IPs:
  IP:           10.42.3.237
Controlled By:  StatefulSet/loki-stack
Containers:
  loki:
    Container ID:  docker://c24efa4c9b21ba76cb26151da17dd7a7d48ee3e508452b97b5f2847b5c4e0dd7
    Image:         grafana/loki:2.6.1
    Image ID:      docker-pullable://grafana/loki@sha256:1ee60f980950b00e505bd564b40f720132a0653b110e993043bb5940673d060a
    Ports:         3100/TCP, 9095/TCP, 7946/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      -config.file=/etc/loki/loki.yaml
    State:          Running
      Started:      Tue, 10 Jan 2023 16:12:23 +0500
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 10 Jan 2023 16:11:23 +0500
      Finished:     Tue, 10 Jan 2023 16:12:22 +0500
    Ready:          True
    Restart Count:  2
    Liveness:       http-get http://:http-metrics/ready delay=45s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http-metrics/ready delay=45s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data from storage (rw)
      /etc/loki from config (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2xnh9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  storage-loki-stack-0
    ReadOnly:   false
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  loki-stack
    Optional:    false
  kube-api-access-2xnh9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From     Message
  ----     ------     ----                ----     -------
  Warning  Unhealthy  43m                 kubelet  Readiness probe failed: Get "http://10.42.3.237:3100/ready": dial tcp 10.42.3.237:3100: connect: connection refused
  Warning  Unhealthy  43m                 kubelet  Liveness probe failed: Get "http://10.42.3.237:3100/ready": dial tcp 10.42.3.237:3100: connect: connection refused
  Normal   Killing    42m                 kubelet  Container loki failed liveness probe, will be restarted
  Normal   Started    42m (x3 over 123m)  kubelet  Started container loki
  Normal   Pulled     42m (x3 over 123m)  kubelet  Container image "grafana/loki:2.6.1" already present on machine
  Normal   Created    42m (x3 over 123m)  kubelet  Created container loki
  Warning  Unhealthy  41m (x7 over 122m)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  41m (x6 over 122m)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503
$ k logs loki-stack-0

level=error ts=2023-01-10T11:52:45.171326574Z caller=table.go:152 table-name=index_19367 msg="failed to remove working directory /data/loki/boltdb-shipper-compactor/index_19367" err="unlinkat /data/loki/boltdb-shipper-compactor/index_19367/.nfs00000000000a088e00003192: device or resource busy"
level=error ts=2023-01-10T11:52:45.171437058Z caller=compactor.go:492 msg="failed to compact files" table=index_19367 err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"
level=error ts=2023-01-10T11:52:45.171593177Z caller=compactor.go:432 msg="failed to run compaction" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"

S0b1t avatar Jan 10 '23 12:01 S0b1t

@S0b1t

$ k logs loki-stack-0

level=error ts=2023-01-10T11:52:45.171326574Z caller=table.go:152 table-name=index_19367 msg="failed to remove working directory /data/loki/boltdb-shipper-compactor/index_19367" err="unlinkat /data/loki/boltdb-shipper-compactor/index_19367/.nfs00000000000a088e00003192: device or resource busy" level=error ts=2023-01-10T11:52:45.171437058Z caller=compactor.go:492 msg="failed to compact files" table=index_19367 err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference" level=error ts=2023-01-10T11:52:45.171593177Z caller=compactor.go:432 msg="failed to run compaction" err="recovered from panic opening boltdb file: runtime error: invalid memory address or nil pointer dereference"

This suggests that loki is trying to delete a file which is still opened by loki itself. Also, what these logs suggest is that loki is trying to access the ".nfs000xxxx" files, which seems weird. Loki shouldn't access any of these files at any given point, as they are just a "reminder" that the files need to be deleted, once the process ends that opened that file. So I would suggest that Loki doesn't read any files that have a dot in front of them or start with ".nfs", if finding the place where the file handle is being leaked isn't an easy option.

The .nfs0000xxxx files are created by the nfs-client when it tries to delete a file which is opened by another process. Since Loki runs as StatefulSet within k8s, it has 1 volume per pod. So the only process who could keep the file open is loki itself.

Seems like the files in loki are not closed properly, before trying to run the delete operation. We have been using NFS as a storage for other services with out any issues in the past: 2 years of running Postgresql database, mongodb and rabbitmq persistance on it without ever having issues with corruption or data los until we tried loki.

My knowledge of the structure of loki is zero as is my knowledge of go, otherwise I'd dig deeper into the source and try to find the exact issue. But from error messages I received it suggests that the error may lie in the storage.go in the storeForPeriod method or somewhere in log.go file

level=error ts=2023-02-09T16:39:55.043613981Z caller=index_set.go:285 table-name=loki_index_19395 msg="sync failed, retrying it" err=timeout
level=error ts=2023-02-09T16:39:55.043650239Z caller=index_set.go:103 table-name=loki_index_19395 msg="failed to initialize table loki_index_19395, cleaning it up" err=timeout
level=error ts=2023-02-09T16:39:55.04378379Z caller=log.go:171 msg="error running loki" err="timeout\nerror creating index client\ngithub.com/grafana/loki/pkg/storage.(*store).storeForPeriod\n\t/src/loki/pkg/storage/store.go:270\ngithub.com/grafana/loki/pkg/storage.(*store).init\n\t/src/loki/pkg/storage/store.go:164\ngithub.com/grafana/loki/pkg/storage.NewStore\n\t/src/loki/pkg/storage/store.go:147\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:635\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:120\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:449\nmain.main\n\t/src/loki/cmd/loki/main.go:110\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nerror initialising module: store\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:122\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:449\nmain.main\n\t/src/loki/cmd/loki/main.go:110\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"
Could not flush logger <nil>

Resources on this topic:

  • https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/What_are_nfsXXXX_files_and_how_do_I_delete_them
  • https://serverfault.com/questions/201294/nfsxxxx-files-appearing-what-are-those

TsengSR avatar Feb 09 '23 16:02 TsengSR

@TsengSR Thanks for the answer ! After reading these topics, I'm starting to understand the situation.

Resources on this topic:

  • https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/What_are_nfsXXXX_files_and_how_do_I_delete_them
  • https://serverfault.com/questions/201294/nfsxxxx-files-appearing-what-are-those

By the way, I already set up the loki stack in another way, installed promtail using helm chart and set up loki on another ec2 instance on the host.

I'm new in kubernetes, so after running a few microservices I started having other errors, like pod don't run, insufficient resources.

After reading this Pod don't run, insufficient resources, I figured something out, on my NFS instance has 200 GB of storage of which 190 GB was given to other microservices. Then I created 30 GB pvc for loki stack. Hmm. Perhaps that's why "device or resource busy" in the log, but this is just a guess.

I have another k8s cluster, I will try the PLG stack with NFS again.

S0b1t avatar Feb 12 '23 03:02 S0b1t

In our case its not the disk space,, Loki been acting like this since I tried to install it in the first place 3 month ago (We're at 60 GB out of 170 or so with 10 Gi for Loki). In the firs attempts we used loki configured with filesystem, later against s3 (using MinIO). It's persisting properly in MinIO, but the PVs used for write-ahead-log (wal) and cache are still on NFS and having the same issues with the .nfs000 files.

TsengSR avatar Feb 12 '23 11:02 TsengSR

I have moved to loki distributed and since then everything is working without any issue.

darox avatar Feb 13 '23 07:02 darox

@darox

I have moved to loki distributed and since then everything is working without any issue.

Mind posting your values yaml and helm chart/version used? I've tried the bitnami helm charts, the distributed chart (based on 2.x helm charts) and the new 3.x unified helm chart. In all of them, the queriers/readers crash after some time

TsengSR avatar Feb 13 '23 23:02 TsengSR

I experience the same as @TsengSR. I feel like the real issue is that loki tries to read these .nfs00... files and obviously cannot process them.

jocelynthode avatar Feb 22 '23 20:02 jocelynthode

I am facing the same issue even today with Loki version 2.6.1, and any good work around is still not available. After surfing through multiple git issue pages, it seems that It is an issue with Loki version 2.X. Can anyone confirm this. And, Please share if you have found a great solution to the issue

joyienjoy avatar May 06 '23 10:05 joyienjoy

it seems that It is an issue with Loki version 2.X. Can anyone confirm this.

I think so too

S0b1t avatar May 08 '23 10:05 S0b1t

it seems that It is an issue with Loki version 2.X. Can anyone confirm this.

I think so too

Now do you have a better way to solve it?

saitama-24 avatar Jun 02 '23 08:06 saitama-24

In my case, I have deployed loki with Bitnami helm chart.

I just deleted the cache at this location: /bitnami/grafana-loki/loki/cache in the loki querier pods

And it works fine.

alexandreMegel avatar Jan 10 '24 09:01 alexandreMegel