loki icon indicating copy to clipboard operation
loki copied to clipboard

WAL corruption leads to endless restarts

Open hervenicol opened this issue 1 year ago • 35 comments

Describe the bug

loki-write pod is dying with this log:

msg="error running loki" err="corruption in segment /var/loki/tsdb-shipper-active/wal/s3_2024-01-02/1712203235/00000004 at 65536: last record is torn\nerror recovering from TSDB WAL"

and restarts indefinitely (crashlooping).

But at each restart it reads the WAL and updates object storage. On a big WAL this can cost a lot because all the data is sent to the object storage again and again.

To Reproduce

Steps to reproduce the behavior:

  1. Running Loki 2.9.6
  2. It happens once in a while on clusters that are a bit undersized and where pods tend to die OOM.
  3. This does not happen consistently.

Expected behavior I can understand the WAL can get corrupted when the app unexpectedly crashes. But maybe when the WAL is corrupted it should discard it? So after it crashes once it can start properly, and not retry endlessly?

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm 5.47.2

hervenicol avatar Apr 11 '24 20:04 hervenicol

But maybe when the WAL is corrupted it should discard it?

tbh I prefer the current behavior, as it at least gives you the chance of doing a backup of your corrupted WAL etc.

DylanGuedes avatar Apr 11 '24 20:04 DylanGuedes

the corruption and restart loops are rough I don't think I like this behavior. It can make it hard to recover from an outage caused by being overwhelmed with data when multiple ingesters may be in this situation.

and in particular here the corruption comes from the last segment which failed because it hit the disk limit, so if we just truncated the WAL at whatever we could read that's really the best you can do anyway.

I did look at the code paths here a little while ago and it wasn't a quick fix to change the behavior, but I do think I'd prefer that we handle this automatically.

interestingly we don't ever see this and after some discussions I've had with folks that do I think the difference is we have 100GB PVC and we typically only sent at most about 15MB/s to an ingester, so it becomes pretty hard to for us to run out of WAL disk before running out of memory.

slim-bean avatar Apr 11 '24 22:04 slim-bean

Would there be something to do in case this corruption happens, I hit this issue error running loki" err="corruption in segment /var/loki/tsdb-index/wal and the quick solution was to recreate the PVC empty (and loose some logs of course) ?

Thanks

jcdauchy-moodys avatar May 21 '24 06:05 jcdauchy-moodys

Getting the same on prod environment, any updates on this?

Farisagianda avatar Jun 20 '24 21:06 Farisagianda

Same here

DanielCastronovo avatar Jun 21 '24 09:06 DanielCastronovo

yep, same for us too.

t4ov avatar Jul 04 '24 15:07 t4ov

Just had this as well, interestingly enough it was on a month-old segment that was about to be deleted by retention:

msg="error running loki" err="corruption in segment /var/loki/tsdb-shipper-active/wal/gcs_2024-06-01/1719931836/00000000 at 8255: unexpected checksum a7f03ba2

Or maybe it is corrupted on that PVC from earlier on as it scales up and down. I will check if that repeats. Either way, suggestions on how to handle it despite deleting the PVC would be appreciated :pray:

LukoJy3D avatar Jul 08 '24 19:07 LukoJy3D

Similar problem in Loki 3.1 on AKS disks. code-stdin-4UP.txt I have been deleted PVC and pod so far to recover but this is very cumbersome.

oskarm93 avatar Jul 16 '24 11:07 oskarm93

+1 here too, not an ideal experience in a Prod environment.

elburnetto-intapp avatar Aug 06 '24 09:08 elburnetto-intapp

I just created a new Pod that mounted the same volume as my loki container that was crashing, and deleted the specific segment.

In my case the file looked something like /var/loki/tsdb-shipper-active/wal/gcs_2024-06-01/1719931836/00000000

And I just ran rm -r /var/loki/tsdb-shipper-active/wal/gcs_2024-06-01

Deleted my one-off Pod, and deleted the Loki container for k8s to restart it, and Loki came right back up. I agree, deleting the whole PVC is very cumbersome, but the log tells you the exact problematic file so I don't know why that's the recommended procedure here anyway..

But it'd be great if Loki had another method of gracefully recovering from apparent data corruption. I won't comment on just not getting corrupted data in the first place since it's unclear how Loki gets into that state anyway.. Presumably a container was force killed during a wal write procedure or something like that..

Starttoaster avatar Aug 08 '24 23:08 Starttoaster

Hello, I'm kind of late but I had the same thing happen to me last week and deleting data from the disk and restarting loki resolved the issue. However did anyone find out how to prevent this from happening in the future ? or why loki wal gets corrupted randomly ?

HediKamoun0 avatar Sep 05 '24 08:09 HediKamoun0

For k8s deployments, enabling PVC deletion solved the looping of this issue. Well, at least the corrupted WAL is not attached to another pod again:

write:
  persistence:
    enableStatefulSetAutoDeletePVC: true

LukoJy3D avatar Sep 05 '24 13:09 LukoJy3D

This will delete the PVC if the statefulset is deleted but what if it is always deployed then nothing would happen right ?

HediKamoun0 avatar Sep 06 '24 13:09 HediKamoun0

Even if it did delete the PVCs on each StatefulSet rollout, that sounds horrible. Loki only sends logs to your object storage bucket in intermittent cycles, so you're going to lose probably a couple of hours of logs on each redeploy. That doesn't strike me as a valid workaround to WAL corruption at all.

So far people in this Issue are recommending the worst workarounds that lead to the most data loss possible... My earlier comment specifies a workaround that only loses the corrupted file, and not everything else.

or why loki wal gets corrupted randomly ?

My earlier comment mentions a theory for why it happens. More of an educated guess, really.

Starttoaster avatar Sep 10 '24 18:09 Starttoaster

i was also getting the error "corruption in segment..." and i solved it just by scaling the statefulset to 0 and then back to the nr. of pods it was

peress avatar Sep 12 '24 09:09 peress

I faced the same problem when the used disk space reached 100%.

error running loki" err="corruption in segment /data/loki/index/wal/filesystem_2023-05-01/1727455534/00000000 at 81944: unexpected checksum 2ffb91ba, expected 5999e2d7
error recovering from TSDB WAL
...

image: grafana/loki:2.9.4 chart: grafana/loki-stack:2.16.0

I think soft errors can also lead to this problem

KovalevDima avatar Sep 28 '24 22:09 KovalevDima

Also facing this issue, occured when disk quota was exceeded. Upon expanding the PVC storage size, it came back up with the corrupted segment error. I am running Loki 3.2.1.

sternma avatar Nov 04 '24 15:11 sternma

We were also hit by this. If it is really caused by the last segment not being written fully during ENOSPC, would it make sense to not write that segment, and just return an error?

gman0 avatar Nov 27 '24 09:11 gman0

Maybe loki should implement a disk watermark and stop receiving logs when the disk is nearly full to prevent a corruption?

Serendo avatar Nov 28 '24 08:11 Serendo

I'm facing this issue too. Every time Loki receives a large batch of logs, the pod loki-write gives this error.

eduardorerickpimenta avatar Dec 06 '24 12:12 eduardorerickpimenta

Just to comment here, seeing more and more mentions of this and some duplicate issues, we'll try to prioritize making a change to a better behavior here of at least not getting stuck in a reboot loop on the corrupt segment when running out of disk.

It's more difficult to have preventative measures here, I don't know that you can always rely on calls to disk space to be accurate and they would come at some cost meaning we'd need to run them infrequently still leaving opportunities for sudden volume spikes to run out of disk. Perhaps a configurable size limit would be an option but Loki really doesn't need more configurations 😬

A good first step would be to stop the corruption from preventing starts so this is likely where we will start.

For folks that see this a lot though, generally I would say you are running too close to the edge so you should do some or all of these things:

  • run bigger WAL disks
  • run more ingesters
  • make sure you have automatic stream sharding enabled
  • set better ingestion limits on the distributors, this is really the tool designed to handle this kind of problem, you can't run out of disk space on your ingester wal's if you rate limit the incoming data to a rate your ingesters can handle for flushing

slim-bean avatar Dec 19 '24 14:12 slim-bean

But maybe when the WAL is corrupted it should discard it?

tbh I prefer the current behavior, as it at least gives you the chance of doing a backup of your corrupted WAL etc.

Just chipping in here, I think moving this aside would be sensible in this case if you have the time and resources to recover, or if that is not possible even giving the option to just delete corrupt WALs and continue. Preventing the service to start may (will usually) be more disruptive than limited data loss.

I can only speak for us here, and we do not have the resources, nor is the data valuable enough to motivate spending someones time trying to manually recover a corrupt WAL file. We would just get into the volume and "rm" the file. It we could configure this to be the default behavior, we would prefer to use this. If moving the file away or continuing without it somehow, we would end up wasting disk space on this without us knowing. We would prefer operational stability over the (very slight) risk of losing some logs in exceptional situations, but I get that this is use case dependent, so some kind if configuratbility might make sense here

rickardp avatar Feb 13 '25 18:02 rickardp

The loki ingester knows exactly which file is corrupted, and once corrupted there's really nothing much that can be done about the file except to remove it..

An opt-in option to auto-delete the corrupted file would save us all from late night on-calls!

dlemfh avatar Feb 19 '25 14:02 dlemfh

I'm fine with an opt in auto-delete option. Alternatively, and this is slightly more complicated but could even be the default behavior imo: move the corrupted file somewhere within the ingester's data volume where it won't cause the ingester to fail to start up. A metric could be exported by the metrics endpoint exposed on :3100 that informs on how many corrupted WALs there are in the volume.

Starttoaster avatar Feb 19 '25 20:02 Starttoaster

I'm fine with an opt in auto-delete option. Alternatively, and this is slightly more complicated but could even be the default behavior imo: move the corrupted file somewhere within the ingester's data volume where it won't cause the ingester to fail to start up. A metric could be exported by the metrics endpoint exposed on :3100 that informs on how many corrupted WALs there are in the volume.

As long as this can be turned off and auto-delete enabled. The risk with what you suggest is that the disk is slowly filling up with corrupted WALs without anyone noticing, and the disk gets full at the worst possible time

rickardp avatar Feb 19 '25 22:02 rickardp

The risk with what you suggest is that the disk is slowly filling up with corrupted WALs without anyone noticing, and the disk gets full at the worst possible time

Yeah, that's solved by the metric portion of my comment, with an implied second step of setting up alerts in Prometheus. You should also probably have more general alerts on your k8s volumes configured.

But moving the corrupted files out of the way by default comes with the benefit of potentially being able to salvage data out of them if you happen to be some form of data recovery expert.

Starttoaster avatar Feb 19 '25 23:02 Starttoaster

  • set better ingestion limits on the distributors, this is really the tool designed to handle this kind of problem, you can't run out of disk space on your ingester wal's if you rate limit the incoming data to a rate your ingesters can handle for flushing

Hi, could you please point to a set of parameters we need to focus to increase?

s-luchina avatar Feb 26 '25 10:02 s-luchina

The risk with what you suggest is that the disk is slowly filling up with corrupted WALs without anyone noticing, and the disk gets full at the worst possible time

Yeah, that's solved by the metric portion of my comment, with an implied second step of setting up alerts in Prometheus. You should also probably have more general alerts on your k8s volumes configured.

But moving the corrupted files out of the way by default comes with the benefit of potentially being able to salvage data out of them if you happen to be some form of data recovery expert.

In that case you’d still have to get inside the ingester pod and manually rm the file (or create a script to do this).

But in any case, the current behavior of not starting up at all is the most disruptive one, so anything is an improvement IMHO.

@slim-bean Out of disk space is not the only case when we have seen this. It also happened to us on several occasions when we had unscheduled node failures (due to issues in the cloud provider) as an example.

rickardp avatar Feb 27 '25 06:02 rickardp

In that case you’d still have to get inside the ingester pod and manually rm the file (or create a script to do this).

To be clear, I'm kind of hoping both the opt-in auto-delete feature, and defaulted move-out-of-the-way ideas are implemented. Or something like them anyway.

But in any case, the current behavior of not starting up at all is the most disruptive one, so anything is an improvement IMHO.

Agreed.

Starttoaster avatar Mar 03 '25 18:03 Starttoaster

An opt-in option to auto-delete the corrupted file would save us all from late night on-calls!

This would have saved us.

aviau avatar Mar 28 '25 20:03 aviau