browsertrix-crawler
browsertrix-crawler copied to clipboard
S3 upload and region error
Hello, I made some tests with S3 upload. It works very well with a bucket stored in us-east-1, but doesn't work for other regions, in my case eu-west-3. I try to analyze the code, and it seems it failed in MinIO call in util/storage.ts line 90:
await this.client.fPutObject(
this.bucketName,
this.objectPrefix + targetFilename,
srcFilename,
);
where the region can optionnally be provided, otherwise us-east-1 is used by default.
The region can be extracted from the STORE_ENDPOINT_URL. example : https://<bucket>.s3.<region>.amazonaws.com/<bucket>/
Best regards
We are planning to switch to AWS CLI SDK, I think this will be addressed with that change. See: #479
Hi @cmillet2127, based on a discussion in the minio-js repo I think the crawler should work as-is and minio-js will autodiscover the bucket if you use s3.amazonaws.com as the STORE_ENDPOINT_URL. Want to try that out and let us know if it works?
The change in #543 has broken it for me in the 'eu-central-1' region. Now I'm getting an error:
S3Error: The authorization header is malformed; the region 'auto' is wrong; expecting 'eu-central-1'
After going back to 1.1.0 Beta 5, it's working fine. This is my env configuration:
environment: [
{ name: 'STORE_ENDPOINT_URL', value: 'https://s3.amazonaws.com/' + process.env.S3_BUCKET },
{ name: 'STORE_PATH', value: '/' },
{ name: 'STORE_FILENAME', value: item.s3_key },
{ name: 'STORE_ACCESS_KEY', value: process.env.S3_ACCESS_KEY },
{ name: 'STORE_SECRET_KEY', value: process.env.S3_SECRET },
{ name: 'CRAWL_ID', value: crawlId },
{ name: 'WEBHOOK_URL', value: process.env.WEBRECORDER_HOOK }
]
Perhaps just adding a STORE_REGION variable would be a better solution? @ikreymer @tw4l
@RomanSmolka agreed, that seems like a safer solution. This probably needs testing with a lot more S3 providers, which we just don't have resources to do. Making this customizable seems like the most flexible/safe option. @cmillet2127 @mguella would that work for you also?
@RomanSmolka agreed, that seems like a safer solution. This probably needs testing with a lot more S3 providers, which we just don't have resources to do. Making this customizable seems like the most flexible/safe option. @cmillet2127 @mguella would that work for you also?
Yes, it seems good solution. Even if we would be able to use the generic url to access Amazon S3 url, it could make additional charge for the routing between AWS region. So it would be defintiely better to be able to specify the region.