thanos
thanos copied to clipboard
Compact: crashes when finding a sample with out of order labels
Thanos: v0.27.0-rc.0
What happened:
Thanos Compact found a posting with out of order labels and started crashing infinitely.
What you expected to happen:
Thanos Compact finds a posting with out of order labels and then:
- Fixes it before compacting if possible.
- Otherwise, ignores it.
- Log at warning/critical level that this happened.
- Exports a metric to track this error so that alerts can be created to trigger when this happens a lot.
Most importantly, I would love Compact to not completely crash and stop doing its job.
How to reproduce it (as minimally and precisely as possible):
Have a block with posting containing out of order labels and try to compact them.
Full logs to relevant components:
level=warn ts=2022-07-13T13:09:33.180528476Z caller=index.go:267 msg="out-of-order label set: known bug in Prometheus 2.8.0 and below" labelset="{_id=\"test\", __name__=\"rhobs_e2e\"}" series=244977
level=warn ts=2022-07-13T13:09:33.180773659Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels"
level=info ts=2022-07-13T13:09:33.180794468Z caller=http.go:84 service=http/server component=compact msg="internal server is shutting down" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels"
level=info ts=2022-07-13T13:09:33.182943478Z caller=http.go:103 service=http/server component=compact msg="internal server is shutdown gracefully" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels"
level=info ts=2022-07-13T13:09:33.18296634Z caller=intrumentation.go:81 msg="changing probe status" status=not-healthy reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels"
level=error ts=2022-07-13T13:09:33.183061471Z caller=main.go:158 err="downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels\nfirst pass of downsampling failed\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:440\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:476\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:75\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:475\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\nerror executing compaction\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:503\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:75\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:475\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/[email protected]/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\ncompact command...
Anything else we need to know:
The metric was written via Receive. So it looks like there isn't a validation there for out of order labels.
I'm wondering why the compactor does not halt in this case :thinking: Someone with more compact knowledge, @bwplotka @yeya24 @GiedriusS?
Apart from that, fixing manually should be an option with bucket verify
(https://github.com/thanos-io/thanos/pull/964)
Halt only happens at the compaction stage. This one happens during downsampling so no halting if I understand correctly. I am also wondering when TSDB persists the head block to disk, why labels are not sorted. If that's the design, then we are required to sort labels at ingestion time.
Thanks for the pointers @yeya24, I realized now that this actually happens in the downsampling phase and not in the compaction. Since we had the debug.accept-malformed-index
flag enabled, compaction went through but now we have an 'incosistency', since downsampling does not have an option to ignore malformed index and it always errors out and crashes compactor on that error.
Luckily we hit this with some test metrics which we don't really need and we can just delete the offending block. But I'm wondering what would be a better course of action, to not get compactor into crash loop (besides ensuring ordering at ingestion time, which I'll look at in #5499).
@matej-g 你好 I deployed Thanos Compact component in K8S. After crash, K8S will restart the POD. Will the restart continue from the last compression node or will it fall into an infinite loop from the beginning?
This is now resolved via #5690