fix(sidecar): Allow sidecar to not crash on startup if objstore is not available
This PR adds --shipper.retry-init option to sidecar to allow it to continue to serve prometheus read path if shipper init is failing at startup (eg: network issue on objstore).
On each upload, if shipper is not initialized, reinitialization will be attempted again.
This PR also brings a new metric thanos_sidecar_shipper_up to alert in case of shipper issue.
- [X] I added CHANGELOG entry for this change.
- [X] Change is not relevant to the end user.
Changes
- New
--shipper.retry-initoption to allow retrying shipper init on each upload - New
thanos_sidecar_shipper_upmetric - Add
.binto.gitignore
Verification
I used thanos-docker-compose. I then configured a bogus objstore.config to force crash.
1. Current situation with broken objstore at startup
ts=2024-09-06T13:30:59.31896362Z caller=main.go:145 level=error err="decode account key: illegal base64 data at input byte 0\ncreate AZURE client\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/Users/adecreme/go/pkg/mod/github.com/thanos-io/[email protected]/client/factory.go:90\nmain.runSidecar\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/sidecar.go:370\nmain.registerSidecar.func1\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/sidecar.go:105\nmain.main\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/main.go:143\nruntime.main\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/proc.go:271\nruntime.goexit\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/asm_arm64.s:1222\npreparing sidecar command failed\nmain.main\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/main.go:145\nruntime.main\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/proc.go:271\nruntime.goexit\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/asm_arm64.s:1222"
Then sidecar crashes.
2. With --shipper.retry-init with broken objstore
ts=2024-09-06T14:47:30.920080081Z caller=grpc.go:167 level=info service=gRPC/server component=sidecar msg="listening for serving gRPC" address=0.0.0.0:10901
ts=2024-09-06T14:47:32.916587861Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2024-09-06T14:47:32.917102072Z caller=azure.go:149 level=debug msg="creating new Azure bucket connection" component=sidecar
ts=2024-09-06T14:47:32.917182156Z caller=sidecar.go:401 level=warn err="create AZURE client: decode account key: illegal base64 data at input byte 0" msg="Failed to create bucket. Sidecar will start without upload feature and will retry later."
ts=2024-09-06T14:48:02.917020664Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2024-09-06T14:48:02.917300082Z caller=azure.go:149 level=debug msg="creating new Azure bucket connection" component=sidecar
ts=2024-09-06T14:48:02.917340791Z caller=sidecar.go:401 level=warn err="create AZURE client: decode account key: illegal base64 data at input byte 0" msg="Failed to create bucket. Sidecar will start without upload feature and will retry later."
Sidecar doesn't crash.
Metric:
# HELP thanos_sidecar_shipper_up Boolean indicator whether the sidecar shipper is running.
# TYPE thanos_sidecar_shipper_up gauge
thanos_sidecar_shipper_up 0
Do all object storage clients attempt to create the bucket?
Edit: I spot checked the s3 provider and it doesnt create a bucket - so this would likely break that, right? As in - if we start with S3 we will never create the bucket and human interaction is needed. FWIW: we have a sidecar that creates the bucket if it doesnt exist.
So, in objstore factory.go, I see that all objstore call NewBucket
case string(GCS):
bucket, err = gcs.NewBucket(context.Background(), logger, config, component)
case string(S3):
bucket, err = s3.NewBucket(logger, config, component)
case string(AZURE):
bucket, err = azure.NewBucket(logger, config, component)
case string(SWIFT):
bucket, err = swift.NewContainer(logger, config)
case string(COS):
bucket, err = cos.NewBucket(logger, config, component)
case string(ALIYUNOSS):
bucket, err = oss.NewBucket(logger, config, component)
case string(FILESYSTEM):
bucket, err = filesystem.NewBucketFromConfig(config)
case string(BOS):
bucket, err = bos.NewBucket(logger, config, component)
case string(OCI):
bucket, err = oci.NewBucket(logger, config)
case string(OBS):
bucket, err = obs.NewBucket(logger, config)
The PR moved the call to NewBucket that occurs for all provider from the startup to the scheduled upload part so it shouldn't break S3.
So, in objstore
factory.go, I see that all objstore callNewBucketcase string(GCS): bucket, err = gcs.NewBucket(context.Background(), logger, config, component) case string(S3): bucket, err = s3.NewBucket(logger, config, component) case string(AZURE): bucket, err = azure.NewBucket(logger, config, component) case string(SWIFT): bucket, err = swift.NewContainer(logger, config) case string(COS): bucket, err = cos.NewBucket(logger, config, component) case string(ALIYUNOSS): bucket, err = oss.NewBucket(logger, config, component) case string(FILESYSTEM): bucket, err = filesystem.NewBucketFromConfig(config) case string(BOS): bucket, err = bos.NewBucket(logger, config, component) case string(OCI): bucket, err = oci.NewBucket(logger, config) case string(OBS): bucket, err = obs.NewBucket(logger, config)The PR moved the call to
NewBucketthat occurs for all provider from the startup to the scheduled upload part so it shouldn't break S3.
New bucket creates a new implementation of the objstore bucket interface - that does not mean that they create a bucket in object storage i think!
Note: pull request updated to make this new behavior disabled by default and enabled only via --shipper.retry-init option
@amaury-d can you solve the conflict please ;). I saw @saswatamcode at the Thanos booth during KubeCon. He will take another look 🙏
Hello @saswatamcode,
sorry to bothering you with this PR again. Is it possible for you to take a look at it in the coming weeks ?
It would be great for us to be able to move forward on that one :).
Have a great day @saswatamcode !
Cheers,