thanos icon indicating copy to clipboard operation
thanos copied to clipboard

[Compactor] panic: unexpected seriesToChunkEncoder lack of iterations

Open piotrhryszko-img opened this issue 1 year ago • 12 comments

Thanos, Prometheus and Golang version used:

thanos, version 0.31.0 (branch: HEAD, revision: 50c464132c265eef64254a9fd063b1e2419e09b7)
  build user:       root@63f5f37ee4e8
  build date:       20230323-10:13:38
  go version:       go1.19.7
  platform:         linux/amd64

Object Storage Provider: S3

What happened: Thanos compact throws panic: unexpected seriesToChunkEncoder lack of iterations and exists What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Uncomment if you would like to post collapsible logs:

Logs

panic: unexpected seriesToChunkEncoder lack of iterations

goroutine 50 [running]:
github.com/prometheus/prometheus/storage.(*compactChunkIterator).Next(0xc000b56bd0)
	/go/pkg/mod/github.com/prometheus/[email protected]/storage/merge.go:753 +0x88c
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc00091f260, {0xc00061a120, 0x2, 0x69?}, 0xc0003128f0, {0x2b54960, 0xc000562580}, {0x2b4dc80, 0xc000f63310})
	/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:771 +0x1488
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc00091f260, {0xc000bc62c0, 0x37}, 0xc0003128f0, {0xc00061a120, 0x2, 0x2})
	/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:597 +0x64d
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).Compact(0xc00091f260, {0xc000bc62c0, 0x37}, {0xc0000b9fe0, 0x2, 0x4057e40?}, {0x0, 0x0, 0xc0008ea000?})
	/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:438 +0x225
github.com/thanos-io/thanos/pkg/compact.(*Group).compact.func3({0x2b5a648?, 0xc00113a960?})
	/app/pkg/compact/compact.go:1075 +0x4a
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2b5a648?, 0xc00113a960?}, {0x25ef5d1?, 0x2?}, 0xc000e91b48, {0x0?, 0xc000a6a240?, 0x0?})
	/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).compact(0xc001585680, {0x2b5a648, 0xc00113a960}, {0xc000bc62c0, 0x37}, {0x2b42f20, 0xc0005f9bc0}, {0x2b4db40, 0xc00091f260})
	/app/pkg/compact/compact.go:1074 +0xcab
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact.func2({0x2b5a648?, 0xc00113a960?})
	/app/pkg/compact/compact.go:775 +0x65
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2b5a5a0?, 0xc000666000?}, {0x25fcb34?, 0x9?}, 0xc000e91e30, {0xc000cc2d80?, 0x43cba7?, 0xc000e91d80?})
	/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact(0xc001585680, {0x2b5a5a0, 0xc000666000}, {0xc00089b8a0, 0x1b}, {0x2b42f20, 0xc0005f9bc0}, {0x2b4db40, 0xc00091f260})
	/app/pkg/compact/compact.go:774 +0x35c
github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact.func2()
	/app/pkg/compact/compact.go:1250 +0x165
created by github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact
	/app/pkg/compact/compact.go:1247 +0x935

Anything else we need to know:

        - args:
            - compact
            - --log.level=info
            - --log.format=logfmt
            - --http-address=0.0.0.0:10902
            - --objstore.config-file=/etc/config/object-store.yaml
            - --data-dir=/var/thanos/compact
            - --consistency-delay=30m
            - --retention.resolution-raw=30d
            - --retention.resolution-5m=180d
            - --retention.resolution-1h=1y
            - --compact.concurrency=1
            - --wait
            - --deduplication.replica-label=__replica__

piotrhryszko-img avatar Oct 05 '23 13:10 piotrhryszko-img

also tried with vertical compaction enabled on another environment and still seeing the same panic

        - args:
            - compact
            - --log.level=info
            - --log.format=logfmt
            - --http-address=0.0.0.0:10902
            - --objstore.config-file=/etc/config/object-store.yaml
            - --data-dir=/var/thanos/compact
            - --consistency-delay=30m
            - --retention.resolution-raw=30d
            - --retention.resolution-5m=180d
            - --retention.resolution-1h=1y
            - --compact.concurrency=1
            - --wait
            - --deduplication.replica-label=__replica__
            - --compact.enable-vertical-compaction
            - --delete-delay=0

piotrhryszko-img avatar Oct 09 '23 10:10 piotrhryszko-img

Is this the same with the newest main version? Could you please try it? 0.31.0 is old :/

GiedriusS avatar Oct 10 '23 09:10 GiedriusS

Hi @GiedriusS upgrading to the latest version didn't resolve the issue

thanos, version 0.32.4 (branch: HEAD, revision: fcd5683e3049924ae26a680e166ae6f27a344896)
  build user:       root@afb5016d2fc4
  build date:       20231002-07:45:12
  go version:       go1.20.8
  platform:         linux/amd64
  tags:             netgo

As per suggestions on Slack deduplication function was added as in our case applications are scraped by multiple Prometheus instances. This stopped errors from happening. However, it also seems to have caused issues with compaction now, as it's been stuck on a single block for more than 3 days now. Current configuration is below

        - args:
            - compact
            - --log.level=debug
            - --log.format=logfmt
            - --http-address=0.0.0.0:10902
            - --objstore.config-file=/etc/config/object-store.yaml
            - --data-dir=/var/thanos/compact
            - --consistency-delay=30m
            - --retention.resolution-raw=30d
            - --retention.resolution-5m=180d
            - --retention.resolution-1h=1y
            - --compact.concurrency=1
            - --wait
            - --deduplication.replica-label=__replica__
            - --deduplication.func=penalty
            - --compact.enable-vertical-compaction
            - --delete-delay=168h

piotrhryszko-img avatar Oct 12 '23 08:10 piotrhryszko-img

However, it also seems to have caused issues with compaction now, as it's been stuck on a single block for more than 3 days now.

What's the reason of the block stuck? Did you see any error?

yeya24 avatar Oct 23 '23 18:10 yeya24

Hey - I've also seen a similar error on 0.32.4

{"caller":"compact.go:708","level":"info","msg":"Found overlapping blocks during compaction","ts":"2023-11-17T22:56:51.255652657Z","ulid":"01HFFR0H1PS6EWAP1ARPPZ4ZG8"}
panic: unexpected seriesToChunkEncoder lack of iterations

goroutine 289 [running]:
github.com/prometheus/prometheus/storage.(*compactChunkIterator).Next(0xc000274b40)
	/go/pkg/mod/github.com/prometheus/[email protected]/storage/merge.go:753 +0x870
github.com/prometheus/prometheus/tsdb.DefaultBlockPopulator.PopulateBlock({}, {0x2d0f3a8, 0xc000789440}, 0xc0008c1500, {0x2cf1be0, 0xc0006ae0c0}, {0x2d00380, 0xc0000d9cc0}, 0xc000012448?, {0xc00143c040, ...}, ...)
	/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:781 +0x1472
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc0006c3860, {0xc00106c0f0, 0x29}, 0xc000806bb0, {0x2cfa620, 0x431d070}, {0xc00143c040, 0x2, 0x2})
	/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:601 +0x6db
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).CompactWithBlockPopulator(0xc0006c3860, {0xc00106c0f0, 0x29}, {0xc00081a340, 0x2, 0x2d28040?}, {0x0, 0x0, 0xc0001ec380?}, {0x2cfa620, ...})
	/go/pkg/mod/github.com/prometheus/[email protected]/tsdb/compact.go:442 +0x6bb
github.com/thanos-io/thanos/pkg/compact.(*Group).compact.func3({0x2d0f3a8, 0xc001c22420})
	/app/pkg/compact/compact.go:1137 +0x125
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2d0f3a8?, 0xc001476270?}, {0x277957c?, 0x2?}, 0xc0010a5aa0, {0x0?, 0xc000ebc500?, 0x1?})
	/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).compact(0xc000bbc8c0, {0x2d0f3a8, 0xc001476270}, {0xc00106c0f0, 0x29}, {0x2cf4280, 0xc000789770}, {0x2d07640, 0xc0006c3860}, {0x2cfa920, ...}, ...)
	/app/pkg/compact/compact.go:1132 +0x10ad
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact.func2({0x2d0f3a8?, 0xc001476270?})
	/app/pkg/compact/compact.go:830 +0xd7
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2d0f300?, 0xc0008186e0?}, {0x2787486?, 0x9?}, 0xc0010a5e10, {0xc0000c60d0?, 0x40e227?, 0x58?})
	/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact(0xc000bbc8c0, {0x2d0f300, 0xc0008186e0}, {0xc0002662a0, 0xd}, {0x2cf4280, 0xc000789770}, {0x2d07640, 0xc0006c3860}, {0x2cfa920, ...}, ...)
	/app/pkg/compact/compact.go:829 +0x3cc
github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact.func2()
	/app/pkg/compact/compact.go:1373 +0x18a
created by github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact
	/app/pkg/compact/compact.go:1370 +0x90a

When searching for 01HFFR0H1PS6EWAP1ARPPZ4ZG8 in bucket web nothing shows up. I also can't see a directory with that name within the object bucket

vCra avatar Nov 17 '23 23:11 vCra

Hi, thanks for all the bug report. I wonder if it is possible for someone to share the problematic block since I don't have a good way to reproduce this issue locally. Please let me know. You can reach out to me on Slack.

yeya24 avatar Nov 18 '23 05:11 yeya24

Seeing this panic on v0.34.0 as well. Also don't see the ulid from the logs in the actual bucket and thanos tools bucket verify --log.level=debug --issues=overlapped_blocks against the bucket doesn't show anything.

Would be happy to provide data if I knew how to find the correct blocks.

bison avatar Feb 28 '24 17:02 bison

Hey @bison I think I narrowed this down to thanos trying to do vertical compaction on already compacted blocks - this could be the case if you've not previously had vertical compaction enabled

If you want to try a hacky fix, you can try disabling compaction for all the blocks before you enabled compaction

(Thats presuming we have the same issue - it could be something different)

In compact, look at the logs before it crashed - it should start to compact several blocks - you'll need to mark these, and you might need to do it lots of times for all the blocks that have already been compacted

vCra avatar Feb 28 '24 21:02 vCra

Hi @vCra, thanks for the investigation.

I think I narrowed this down to thanos trying to do vertical compaction on already compacted blocks - this could be the case if you've not previously had vertical compaction enabled

It is interesting to know that. How did you fugure this out? Ideally it shouldn't matter to compact whether blocks already compacted or not so shouldn't panic. Maybe we miss something.

yeya24 avatar Feb 28 '24 22:02 yeya24

@vCra wow thanks, that's exactly what's happening. Just upgraded this stack and vertical compaction got enabled where it wasn't before. Now the first time the compactor encounters two previously compacted blocks at 5m resolution, it panics. If I mark the same blocks (and all other similar blocks) with no-compact, then compaction completes.

Edit: Actually I guess it's any previously compacted block. I originally thought it was only at that resolution for some reason.

bison avatar Feb 29 '24 12:02 bison

How did you figure this out?

I'm only guessing that this is the issue - compactor kept crashing, and I noticed that we were managing to vertically compact all the new blocks without issue, but the old blocks were not getting vertically compacted - in bucketweb it was quite clear. The issue was that no downsampling was happening - the count of downsample-todo kept on slowly increasing. Looking at the logs was how we solved it - we though it could be 1 or two corrupted blocks, so I kept marking all these blocks as don't compact - we had a large backlog so it took a while, but I slowly started to see a pattern that it was only the old blocks that were having an issue.

Looking at bucket-web, we still have the old blocks, but just not vertically compacted - we don't care too much, as we won't use this data too frequently (10 is with vertical compaction)

Screenshot 2024-02-29 at 23 51 06

The discussion in https://cloud-native.slack.com/archives/CK5RSSC10/p1681966324787459 helped too

vCra avatar Feb 29 '24 23:02 vCra

I spotted this in prod. Looking into it :eye:

GiedriusS avatar Apr 30 '24 07:04 GiedriusS