thanos
thanos copied to clipboard
[Compactor] panic: unexpected seriesToChunkEncoder lack of iterations
Thanos, Prometheus and Golang version used:
thanos, version 0.31.0 (branch: HEAD, revision: 50c464132c265eef64254a9fd063b1e2419e09b7)
build user: root@63f5f37ee4e8
build date: 20230323-10:13:38
go version: go1.19.7
platform: linux/amd64
Object Storage Provider: S3
What happened:
Thanos compact throws panic: unexpected seriesToChunkEncoder lack of iterations
and exists
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
Uncomment if you would like to post collapsible logs:
panic: unexpected seriesToChunkEncoder lack of iterations
goroutine 50 [running]:
github.com/prometheus/prometheus/storage.(*compactChunkIterator).Next(0xc000b56bd0)
/go/pkg/mod/github.com/prometheus/[email protected]/storage/merge.go:753 +0x88c
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc00091f260, {0xc00061a120, 0x2, 0x69?}, 0xc0003128f0, {0x2b54960, 0xc000562580}, {0x2b4dc80, 0xc000f63310})
/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:771 +0x1488
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc00091f260, {0xc000bc62c0, 0x37}, 0xc0003128f0, {0xc00061a120, 0x2, 0x2})
/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:597 +0x64d
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).Compact(0xc00091f260, {0xc000bc62c0, 0x37}, {0xc0000b9fe0, 0x2, 0x4057e40?}, {0x0, 0x0, 0xc0008ea000?})
/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:438 +0x225
github.com/thanos-io/thanos/pkg/compact.(*Group).compact.func3({0x2b5a648?, 0xc00113a960?})
/app/pkg/compact/compact.go:1075 +0x4a
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2b5a648?, 0xc00113a960?}, {0x25ef5d1?, 0x2?}, 0xc000e91b48, {0x0?, 0xc000a6a240?, 0x0?})
/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).compact(0xc001585680, {0x2b5a648, 0xc00113a960}, {0xc000bc62c0, 0x37}, {0x2b42f20, 0xc0005f9bc0}, {0x2b4db40, 0xc00091f260})
/app/pkg/compact/compact.go:1074 +0xcab
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact.func2({0x2b5a648?, 0xc00113a960?})
/app/pkg/compact/compact.go:775 +0x65
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2b5a5a0?, 0xc000666000?}, {0x25fcb34?, 0x9?}, 0xc000e91e30, {0xc000cc2d80?, 0x43cba7?, 0xc000e91d80?})
/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact(0xc001585680, {0x2b5a5a0, 0xc000666000}, {0xc00089b8a0, 0x1b}, {0x2b42f20, 0xc0005f9bc0}, {0x2b4db40, 0xc00091f260})
/app/pkg/compact/compact.go:774 +0x35c
github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact.func2()
/app/pkg/compact/compact.go:1250 +0x165
created by github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact
/app/pkg/compact/compact.go:1247 +0x935
Anything else we need to know:
- args:
- compact
- --log.level=info
- --log.format=logfmt
- --http-address=0.0.0.0:10902
- --objstore.config-file=/etc/config/object-store.yaml
- --data-dir=/var/thanos/compact
- --consistency-delay=30m
- --retention.resolution-raw=30d
- --retention.resolution-5m=180d
- --retention.resolution-1h=1y
- --compact.concurrency=1
- --wait
- --deduplication.replica-label=__replica__
also tried with vertical compaction enabled on another environment and still seeing the same panic
- args:
- compact
- --log.level=info
- --log.format=logfmt
- --http-address=0.0.0.0:10902
- --objstore.config-file=/etc/config/object-store.yaml
- --data-dir=/var/thanos/compact
- --consistency-delay=30m
- --retention.resolution-raw=30d
- --retention.resolution-5m=180d
- --retention.resolution-1h=1y
- --compact.concurrency=1
- --wait
- --deduplication.replica-label=__replica__
- --compact.enable-vertical-compaction
- --delete-delay=0
Is this the same with the newest main
version? Could you please try it? 0.31.0 is old :/
Hi @GiedriusS upgrading to the latest version didn't resolve the issue
thanos, version 0.32.4 (branch: HEAD, revision: fcd5683e3049924ae26a680e166ae6f27a344896)
build user: root@afb5016d2fc4
build date: 20231002-07:45:12
go version: go1.20.8
platform: linux/amd64
tags: netgo
As per suggestions on Slack deduplication function was added as in our case applications are scraped by multiple Prometheus instances. This stopped errors from happening. However, it also seems to have caused issues with compaction now, as it's been stuck on a single block for more than 3 days now. Current configuration is below
- args:
- compact
- --log.level=debug
- --log.format=logfmt
- --http-address=0.0.0.0:10902
- --objstore.config-file=/etc/config/object-store.yaml
- --data-dir=/var/thanos/compact
- --consistency-delay=30m
- --retention.resolution-raw=30d
- --retention.resolution-5m=180d
- --retention.resolution-1h=1y
- --compact.concurrency=1
- --wait
- --deduplication.replica-label=__replica__
- --deduplication.func=penalty
- --compact.enable-vertical-compaction
- --delete-delay=168h
However, it also seems to have caused issues with compaction now, as it's been stuck on a single block for more than 3 days now.
What's the reason of the block stuck? Did you see any error?
Hey - I've also seen a similar error on 0.32.4
{"caller":"compact.go:708","level":"info","msg":"Found overlapping blocks during compaction","ts":"2023-11-17T22:56:51.255652657Z","ulid":"01HFFR0H1PS6EWAP1ARPPZ4ZG8"}
panic: unexpected seriesToChunkEncoder lack of iterations
goroutine 289 [running]:
github.com/prometheus/prometheus/storage.(*compactChunkIterator).Next(0xc000274b40)
/go/pkg/mod/github.com/prometheus/[email protected]/storage/merge.go:753 +0x870
github.com/prometheus/prometheus/tsdb.DefaultBlockPopulator.PopulateBlock({}, {0x2d0f3a8, 0xc000789440}, 0xc0008c1500, {0x2cf1be0, 0xc0006ae0c0}, {0x2d00380, 0xc0000d9cc0}, 0xc000012448?, {0xc00143c040, ...}, ...)
/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:781 +0x1472
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc0006c3860, {0xc00106c0f0, 0x29}, 0xc000806bb0, {0x2cfa620, 0x431d070}, {0xc00143c040, 0x2, 0x2})
/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:601 +0x6db
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).CompactWithBlockPopulator(0xc0006c3860, {0xc00106c0f0, 0x29}, {0xc00081a340, 0x2, 0x2d28040?}, {0x0, 0x0, 0xc0001ec380?}, {0x2cfa620, ...})
/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:442 +0x6bb
github.com/thanos-io/thanos/pkg/compact.(*Group).compact.func3({0x2d0f3a8, 0xc001c22420})
/app/pkg/compact/compact.go:1137 +0x125
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2d0f3a8?, 0xc001476270?}, {0x277957c?, 0x2?}, 0xc0010a5aa0, {0x0?, 0xc000ebc500?, 0x1?})
/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).compact(0xc000bbc8c0, {0x2d0f3a8, 0xc001476270}, {0xc00106c0f0, 0x29}, {0x2cf4280, 0xc000789770}, {0x2d07640, 0xc0006c3860}, {0x2cfa920, ...}, ...)
/app/pkg/compact/compact.go:1132 +0x10ad
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact.func2({0x2d0f3a8?, 0xc001476270?})
/app/pkg/compact/compact.go:830 +0xd7
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2d0f300?, 0xc0008186e0?}, {0x2787486?, 0x9?}, 0xc0010a5e10, {0xc0000c60d0?, 0x40e227?, 0x58?})
/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact(0xc000bbc8c0, {0x2d0f300, 0xc0008186e0}, {0xc0002662a0, 0xd}, {0x2cf4280, 0xc000789770}, {0x2d07640, 0xc0006c3860}, {0x2cfa920, ...}, ...)
/app/pkg/compact/compact.go:829 +0x3cc
github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact.func2()
/app/pkg/compact/compact.go:1373 +0x18a
created by github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact
/app/pkg/compact/compact.go:1370 +0x90a
When searching for 01HFFR0H1PS6EWAP1ARPPZ4ZG8
in bucket web nothing shows up. I also can't see a directory with that name within the object bucket
Hi, thanks for all the bug report. I wonder if it is possible for someone to share the problematic block since I don't have a good way to reproduce this issue locally. Please let me know. You can reach out to me on Slack.
Seeing this panic on v0.34.0
as well. Also don't see the ulid from the logs in the actual bucket and thanos tools bucket verify --log.level=debug --issues=overlapped_blocks
against the bucket doesn't show anything.
Would be happy to provide data if I knew how to find the correct blocks.
Hey @bison I think I narrowed this down to thanos trying to do vertical compaction on already compacted blocks - this could be the case if you've not previously had vertical compaction enabled
If you want to try a hacky fix, you can try disabling compaction for all the blocks before you enabled compaction
(Thats presuming we have the same issue - it could be something different)
In compact, look at the logs before it crashed - it should start to compact several blocks - you'll need to mark these, and you might need to do it lots of times for all the blocks that have already been compacted
Hi @vCra, thanks for the investigation.
I think I narrowed this down to thanos trying to do vertical compaction on already compacted blocks - this could be the case if you've not previously had vertical compaction enabled
It is interesting to know that. How did you fugure this out? Ideally it shouldn't matter to compact whether blocks already compacted or not so shouldn't panic. Maybe we miss something.
@vCra wow thanks, that's exactly what's happening. Just upgraded this stack and vertical compaction got enabled where it wasn't before. Now the first time the compactor encounters two previously compacted blocks at 5m resolution, it panics. If I mark the same blocks (and all other similar blocks) with no-compact, then compaction completes.
Edit: Actually I guess it's any previously compacted block. I originally thought it was only at that resolution for some reason.
How did you figure this out?
I'm only guessing that this is the issue - compactor kept crashing, and I noticed that we were managing to vertically compact all the new blocks without issue, but the old blocks were not getting vertically compacted - in bucketweb it was quite clear. The issue was that no downsampling was happening - the count of downsample-todo kept on slowly increasing. Looking at the logs was how we solved it - we though it could be 1 or two corrupted blocks, so I kept marking all these blocks as don't compact - we had a large backlog so it took a while, but I slowly started to see a pattern that it was only the old blocks that were having an issue.
Looking at bucket-web, we still have the old blocks, but just not vertically compacted - we don't care too much, as we won't use this data too frequently (10 is with vertical compaction)
The discussion in https://cloud-native.slack.com/archives/CK5RSSC10/p1681966324787459 helped too
I spotted this in prod. Looking into it :eye: