browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

S3 upload and region error

Open cmillet2127 opened this issue 1 year ago • 5 comments

Hello, I made some tests with S3 upload. It works very well with a bucket stored in us-east-1, but doesn't work for other regions, in my case eu-west-3. I try to analyze the code, and it seems it failed in MinIO call in util/storage.ts line 90:

    await this.client.fPutObject(
      this.bucketName,
      this.objectPrefix + targetFilename,
      srcFilename,
    );

where the region can optionnally be provided, otherwise us-east-1 is used by default.

The region can be extracted from the STORE_ENDPOINT_URL. example : https://<bucket>.s3.<region>.amazonaws.com/<bucket>/

Best regards

cmillet2127 avatar Mar 26 '24 16:03 cmillet2127

We are planning to switch to AWS CLI SDK, I think this will be addressed with that change. See: #479

ikreymer avatar Mar 26 '24 22:03 ikreymer

Hi @cmillet2127, based on a discussion in the minio-js repo I think the crawler should work as-is and minio-js will autodiscover the bucket if you use s3.amazonaws.com as the STORE_ENDPOINT_URL. Want to try that out and let us know if it works?

tw4l avatar Apr 15 '24 21:04 tw4l

The change in #543 has broken it for me in the 'eu-central-1' region. Now I'm getting an error:

S3Error: The authorization header is malformed; the region 'auto' is wrong; expecting 'eu-central-1'

After going back to 1.1.0 Beta 5, it's working fine. This is my env configuration:

environment: [
  { name: 'STORE_ENDPOINT_URL', value: 'https://s3.amazonaws.com/' + process.env.S3_BUCKET },
  { name: 'STORE_PATH', value: '/' },
  { name: 'STORE_FILENAME', value: item.s3_key },
  { name: 'STORE_ACCESS_KEY', value: process.env.S3_ACCESS_KEY },
  { name: 'STORE_SECRET_KEY', value: process.env.S3_SECRET },
  { name: 'CRAWL_ID', value: crawlId },
  { name: 'WEBHOOK_URL', value: process.env.WEBRECORDER_HOOK }
]

Perhaps just adding a STORE_REGION variable would be a better solution? @ikreymer @tw4l

RomanSmolka avatar May 07 '24 11:05 RomanSmolka

@RomanSmolka agreed, that seems like a safer solution. This probably needs testing with a lot more S3 providers, which we just don't have resources to do. Making this customizable seems like the most flexible/safe option. @cmillet2127 @mguella would that work for you also?

ikreymer avatar May 07 '24 11:05 ikreymer

@RomanSmolka agreed, that seems like a safer solution. This probably needs testing with a lot more S3 providers, which we just don't have resources to do. Making this customizable seems like the most flexible/safe option. @cmillet2127 @mguella would that work for you also?

Yes, it seems good solution. Even if we would be able to use the generic url to access Amazon S3 url, it could make additional charge for the routing between AWS region. So it would be defintiely better to be able to specify the region.

cmillet2127 avatar May 07 '24 11:05 cmillet2127