thanos icon indicating copy to clipboard operation
thanos copied to clipboard

Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes")

Open JensErat opened this issue 4 years ago • 15 comments

Thanos, Prometheus and Golang version used:

Thanos 0.13 rc2, currently upgrading to 0.13 but commits do not indicate any relevant change since the tag was crafted

Object Storage Provider:

Internal Cloudian S3 service

What happened:

Compactor halts with following error message:

level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"

After restarting compactor, compacting continues successfully (usually hours later, we only get warned if compactor does not run for quite some time -- so I cannot tell whether this is a pretty short inconsistency window and lots of bad luck involved).

What you expected to happen:

If this is really a transient issue based on eventual consistency issues, retry and don't halt.

How to reproduce it (as minimally and precisely as possible):

Slightly overload your S3 service and wait. We have a fail rate of estimated once a week for a single compactor. Mostly happens in installations with some network distance (ping >= 200ms)

Full logs to relevant components:

Logs

[...]
level=info ts=2020-06-25T07:42:56.091077077Z caller=block.go:157 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="block has been marked for deletion" block=01EAQYJKPJS17YPE28VZ48JF7Q
level=info ts=2020-06-25T07:42:56.11841702Z caller=compact.go:792 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="marking compacted block for deletion" old_block=01EAR5EAY80Z79JCDWQBAVVGJ4
level=info ts=2020-06-25T07:42:56.358365126Z caller=block.go:157 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="block has been marked for deletion" block=01EAR5EAY80Z79JCDWQBAVVGJ4
level=info ts=2020-06-25T07:42:57.695050512Z caller=compact.go:633 group="0@{monitor=\"prometheus-ha\", replica=\"0\"}" groupKey=0@9955244503410235132 msg="compaction available and planned; downloading blocks" plan="[/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]"
level=info ts=2020-06-25T07:55:17.496533315Z caller=compact.go:692 group="0@{monitor=\"prometheus-ha\", replica=\"0\"}" groupKey=0@9955244503410235132 msg="downloaded and verified blocks; compacting blocks" plan="[/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]" duration=12m19.801396575s
level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"

Anything else we need to know:

Pretty sure related to #2128, this and prometheus/prometheus#6040 (and our observations) make me think this might be a transient issue. Hard to tell since we cannot really reproduce, is #2637 maybe fixing this already?

JensErat avatar Jun 25 '20 11:06 JensErat

This is not a retryable error since we don't have enough data on disk when compaction runs and this occurs.

We have been running into this the past month or so but only on our staging environment. We have found after https://github.com/thanos-io/thanos/pull/2637 that:

  • One chunk file is missing some data during compaction
  • All of the content of the affected file that is on disk has the correct data i.e. there is no corruption going, it only misses some data
  • The calls to S3 succeed i.e. 200 is returned to minio-go
  • We have load-balancers in front of our S3 endpoints and changing the configuration to go to the Ceph Object Gateway directly seems to fix this problem. It has been running for a week without a problem.

My best theory, for now, is that the transfer gets cut off abruptly, and minio-go eats that error up as io.EOF. Then, it thinks that it got everything and the caller continues its execution. I think it has something to do with keep-alive connections not being maintained properly.

After that PR, you should be able to inspect the state of files on disk and see if this happens for you as well. Help welcome!

GiedriusS avatar Jun 25 '20 16:06 GiedriusS

I further analyzed one of the halts:

  • compact uploaded a block successfully (at least no issues indicated in logs) on 2020-06-25 around noon
  • compact downloaded the block again on 2020-06-27 and has incomplete data. Download of the block took around 5h(!), an operation usually done within minutes
  • after restarting compactor, it will download again and successfully continue

thanos-store did not have any issues with this block.

With two days in-between I do not really suspect any issues with eventual consistency. This looks more like connection issues during download, but shouldn't the client library recognize such issues?

If there's anything more we can do, we're glad to jump in. For now, we'll likely switch to debug.halt=false as retrying/redownloading seems to work well.

JensErat avatar Jun 29 '20 12:06 JensErat

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Jul 29 '20 12:07 stale[bot]

Still an issue with us, and automatically restarting/not halting compactor make compaction continue pretty much immediately without any issues.

JensErat avatar Jul 29 '20 14:07 JensErat

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Aug 28 '20 17:08 stale[bot]

We are seeing the same issue. We have 200+ compactors and this is happening in all of them. Restarting pod fixes the issue temporarily.

crsandeep avatar Sep 03 '20 06:09 crsandeep

We are seeing the same issue. We have 200+ compactors and this is happening in all of them. Restarting pod fixes the issue temporarily.

Are you using S3 as well? What underneath is implementing this API i.e. are you using AWS or something else?

GiedriusS avatar Sep 03 '20 07:09 GiedriusS

@GiedriusS We are using Ceph object store which is S3 compatible.

crsandeep avatar Sep 03 '20 16:09 crsandeep

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Nov 03 '20 03:11 stale[bot]

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] avatar Nov 17 '20 04:11 stale[bot]

We are experience the same issue with Thanos v 0.17.2 and the Ceph Object Gateway. @JensErat were you able to solve your issue?

goober avatar Jan 29 '21 08:01 goober

My best theory, for now, is that the transfer gets cut off abruptly, and minio-go eats that error up as io.EOF

From a quick look at Minio internals (very quick), I suspect this the case as well.

pracucci avatar Feb 12 '21 11:02 pracucci

Should be fixed with https://github.com/thanos-io/thanos/pull/3795.

GiedriusS avatar Feb 24 '21 22:02 GiedriusS

Still caught the same error during compaction time. Wondering is there a way for me to overcome this issue or delete the bad series only using the bucket rewrite tool. From the error log, what I can do is very limited.

level=error ts=2022-06-16T06:34:26.254782121Z caller=compact.go:486 msg="critical error detected; halting" err="compaction: group 0@17721352815461611655: compact blocks [/var/thanos/compact/compact/0@17721352815461611655/01G5HSCFX18SE036E3C80SWTMG /var/thanos/compact/compact/0@17721352815461611655/01G5J0875CJCBXEBQKN9608VPJ /var/thanos/compact/compact/0@17721352815461611655/01G5J73YD0EYV92KHTEEWFSVV3 /var/thanos/compact/compact/0@17721352815461611655/01G5JDZNP5A9GS6CQ695XGQ9M0]: populate block: chunk iter: cannot populate chunk 8858369979: segment doesn't include enough bytes to read the chunk - required:268435469, available:268435456"

yeya24 avatar Jun 22 '22 01:06 yeya24

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Sep 21 '22 02:09 stale[bot]

Hi, I also have the same issue with NetApp StorageGrid object storage running thanos 0.30.1. Is there any way to check of fix blocks with tools bucket rewrite tool?

nitnatsnocER avatar Jan 11 '23 17:01 nitnatsnocER