source-controller icon indicating copy to clipboard operation
source-controller copied to clipboard

Source controller listing objects from S3 bucket failed probably due to larger size of the bucket

Open veeraghgoudar opened this issue 2 years ago • 3 comments

We are running flux version 0.28.5. We are using s3 bucket as source where the size of the s3 bucket is around 2TB.

The reconciliation always fails with the below error

flux reconcile source bucket s3-bucket-name --verbose
► annotating Bucket s3-bucket-name in flux-system namespace
✔ Bucket annotated
◎ waiting for Bucket reconciliation
✗ Bucket reconciliation failed: 'indexation of objects from bucket 's3-bucket-name' failed: listing objects from bucket 's3-bucket-name' failed: Get "https://s3.dualstack.us-east-1.amazonaws.com/s3-bucket-name/?continuation-token=1lTiNsKzHoVAVOu0PmalPgNkJEFDnybzDBu8XuqkkoZlCP7DtzXiQm%!!(MISSING)B(MISSING)Hea2BxSsoEprb4N3Wm%!!(MISSING)F(MISSING)3EYVLJ18P%!!(MISSING)F(MISSING)KRR0XAJe6kkXPw%!!(MISSING)D(MISSING)%!!(MISSING)D(MISSING)&delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=": context deadline exceeded'

I have tried add the ignore files as shown below but it did not work. I have also tried increasing the timeout upto 10m for testing purpose

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: Bucket
metadata:
  name:  s3-bucket-name 
  namespace: flux-system
spec:
  interval: 30s
  provider: aws
  bucketName:  s3-bucket-name 
  endpoint: s3.amazonaws.com
  region: us-east-1
  timeout: 300s
  ignore: |
    # exclude all
    /*
    # include flux dir
    !/flux
    # exclude file extensions from deploy dir
    /flux/**/*.md
    /flux/**/*.txt

I am not sure if the ignore options is included in filter here - https://github.com/fluxcd/source-controller/blob/main/pkg/minio/minio.go#L112

When I test with different s3 bucket with much smaller in size, the reconciliation works fine. Is there a way to get this working ?

veeraghgoudar avatar Apr 27 '22 17:04 veeraghgoudar

I guess you don't have 2TB of Kubernetes YAMLs in there? I would create a dedicated bucket for Flux and have a Lambda function that syncs the YAML files from the 2TB bucket to the Flux one.

I have tried add the ignore files as shown below but it did not work.

To ignore files, we need to fetch the all file paths from the bucket, if you have a billion files in there, then it takes time, hours maybe.

stefanprodan avatar Apr 28 '22 07:04 stefanprodan

As Stefan said, a Bucket is more like an enriched key/value storage, and we thus have to iterate over every key to see if it's a match and we can't e.g. "skip directories". The only trick left in this area that might help in your case, is if we would support defining a "prefix" to which files must match. This makes the filtering a server-side operation, and would decrease the number of iterations we have to do.

hiddeco avatar Apr 28 '22 07:04 hiddeco

I guess you don't have 2TB of Kubernetes YAMLs in there? I would create a dedicated bucket for Flux and have a Lambda function that syncs the YAML files from the 2TB bucket to the Flux one.

I have tried add the ignore files as shown below but it did not work.

To ignore files, we need to fetch the all file paths from the bucket, if you have a billion files in there, then it takes time, hours maybe.

@stefanprodan : Yes you are right. The bucket contains lot of other artifacts and not only Yaml files. As suggested by you we would go with a dedicated bucket for flux.

@hiddeco : If the prefix option is supported in future then we might start using it. I was thinking Ignore option would solve this problem and I was wrong. Thanks for the explanation.

veeraghgoudar avatar Apr 28 '22 08:04 veeraghgoudar