thanos icon indicating copy to clipboard operation
thanos copied to clipboard

fix(sidecar): Allow sidecar to not crash on startup if objstore is not available

Open amaury-d opened this issue 1 year ago • 6 comments

This PR adds --shipper.retry-init option to sidecar to allow it to continue to serve prometheus read path if shipper init is failing at startup (eg: network issue on objstore). On each upload, if shipper is not initialized, reinitialization will be attempted again.

This PR also brings a new metric thanos_sidecar_shipper_up to alert in case of shipper issue.

  • [X] I added CHANGELOG entry for this change.
  • [X] Change is not relevant to the end user.

Changes

  • New --shipper.retry-init option to allow retrying shipper init on each upload
  • New thanos_sidecar_shipper_up metric
  • Add .bin to .gitignore

Verification

I used thanos-docker-compose. I then configured a bogus objstore.config to force crash.

1. Current situation with broken objstore at startup

ts=2024-09-06T13:30:59.31896362Z caller=main.go:145 level=error err="decode account key: illegal base64 data at input byte 0\ncreate AZURE client\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/Users/adecreme/go/pkg/mod/github.com/thanos-io/[email protected]/client/factory.go:90\nmain.runSidecar\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/sidecar.go:370\nmain.registerSidecar.func1\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/sidecar.go:105\nmain.main\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/main.go:143\nruntime.main\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/proc.go:271\nruntime.goexit\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/asm_arm64.s:1222\npreparing sidecar command failed\nmain.main\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/main.go:145\nruntime.main\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/proc.go:271\nruntime.goexit\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/asm_arm64.s:1222"

Then sidecar crashes.

2. With --shipper.retry-init with broken objstore

ts=2024-09-06T14:47:30.920080081Z caller=grpc.go:167 level=info service=gRPC/server component=sidecar msg="listening for serving gRPC" address=0.0.0.0:10901
ts=2024-09-06T14:47:32.916587861Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2024-09-06T14:47:32.917102072Z caller=azure.go:149 level=debug msg="creating new Azure bucket connection" component=sidecar
ts=2024-09-06T14:47:32.917182156Z caller=sidecar.go:401 level=warn err="create AZURE client: decode account key: illegal base64 data at input byte 0" msg="Failed to create bucket. Sidecar will start without upload feature and will retry later."
ts=2024-09-06T14:48:02.917020664Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2024-09-06T14:48:02.917300082Z caller=azure.go:149 level=debug msg="creating new Azure bucket connection" component=sidecar
ts=2024-09-06T14:48:02.917340791Z caller=sidecar.go:401 level=warn err="create AZURE client: decode account key: illegal base64 data at input byte 0" msg="Failed to create bucket. Sidecar will start without upload feature and will retry later."

Sidecar doesn't crash.

Metric:

# HELP thanos_sidecar_shipper_up Boolean indicator whether the sidecar shipper is running.
# TYPE thanos_sidecar_shipper_up gauge
thanos_sidecar_shipper_up 0

amaury-d avatar Sep 09 '24 08:09 amaury-d

Do all object storage clients attempt to create the bucket?

Edit: I spot checked the s3 provider and it doesnt create a bucket - so this would likely break that, right? As in - if we start with S3 we will never create the bucket and human interaction is needed. FWIW: we have a sidecar that creates the bucket if it doesnt exist.

MichaHoffmann avatar Sep 09 '24 08:09 MichaHoffmann

So, in objstore factory.go, I see that all objstore call NewBucket

	case string(GCS):
		bucket, err = gcs.NewBucket(context.Background(), logger, config, component)
	case string(S3):
		bucket, err = s3.NewBucket(logger, config, component)
	case string(AZURE):
		bucket, err = azure.NewBucket(logger, config, component)
	case string(SWIFT):
		bucket, err = swift.NewContainer(logger, config)
	case string(COS):
		bucket, err = cos.NewBucket(logger, config, component)
	case string(ALIYUNOSS):
		bucket, err = oss.NewBucket(logger, config, component)
	case string(FILESYSTEM):
		bucket, err = filesystem.NewBucketFromConfig(config)
	case string(BOS):
		bucket, err = bos.NewBucket(logger, config, component)
	case string(OCI):
		bucket, err = oci.NewBucket(logger, config)
	case string(OBS):
		bucket, err = obs.NewBucket(logger, config)

The PR moved the call to NewBucket that occurs for all provider from the startup to the scheduled upload part so it shouldn't break S3.

amaury-d avatar Sep 09 '24 08:09 amaury-d

So, in objstore factory.go, I see that all objstore call NewBucket

	case string(GCS):
		bucket, err = gcs.NewBucket(context.Background(), logger, config, component)
	case string(S3):
		bucket, err = s3.NewBucket(logger, config, component)
	case string(AZURE):
		bucket, err = azure.NewBucket(logger, config, component)
	case string(SWIFT):
		bucket, err = swift.NewContainer(logger, config)
	case string(COS):
		bucket, err = cos.NewBucket(logger, config, component)
	case string(ALIYUNOSS):
		bucket, err = oss.NewBucket(logger, config, component)
	case string(FILESYSTEM):
		bucket, err = filesystem.NewBucketFromConfig(config)
	case string(BOS):
		bucket, err = bos.NewBucket(logger, config, component)
	case string(OCI):
		bucket, err = oci.NewBucket(logger, config)
	case string(OBS):
		bucket, err = obs.NewBucket(logger, config)

The PR moved the call to NewBucket that occurs for all provider from the startup to the scheduled upload part so it shouldn't break S3.

New bucket creates a new implementation of the objstore bucket interface - that does not mean that they create a bucket in object storage i think!

MichaHoffmann avatar Sep 09 '24 09:09 MichaHoffmann

Note: pull request updated to make this new behavior disabled by default and enabled only via --shipper.retry-init option

amaury-d avatar Sep 26 '24 08:09 amaury-d

@amaury-d can you solve the conflict please ;). I saw @saswatamcode at the Thanos booth during KubeCon. He will take another look 🙏

Nexucis avatar Apr 04 '25 09:04 Nexucis

Hello @saswatamcode,

sorry to bothering you with this PR again. Is it possible for you to take a look at it in the coming weeks ?

It would be great for us to be able to move forward on that one :).

Have a great day @saswatamcode !

Cheers,

Nexucis avatar Jun 05 '25 09:06 Nexucis