browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

replace minio client with aws sdk

Open mguella opened this issue 10 months ago • 2 comments

The Minio client has issues connecting to different providers (e.g. Cloudflare R2, see https://github.com/minio/minio-js/issues/619#issuecomment-326158139). Moving to the AWS SDK solves that problem. This should also support files up to 5GB so it helps with https://github.com/webrecorder/browsertrix-crawler/issues/479 but doesn't solve it completely, as for bigger ones (up to 5TB) the multipart upload is probably needed. However it's a first step toward solving the issue of big files and it's already a big improvement over the current Minio client as it allows to use S3 compatible providers different from AWS and Minio.

mguella avatar Apr 16 '24 08:04 mguella

Thanks, we're just doing some additional testing and should be able to merge this soon.

It does seem like Minio's fputObject will not do automated multi-part upload, while AWS S3 Client has: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#upload-property which is a more higher-level command that can automatically use multipart upload if necessary. That may address some of the remaining issues, if we were to switch to use that.

ikreymer avatar May 22 '24 23:05 ikreymer

It does seem like Minio's fputObject will not do automated multi-part upload

Er, no, that's not quite right. fPutObject does support multi-part upload (was it working previously?) https://min.io/docs/minio/linux/developers/javascript/API.html#fPutObject so we probably do want to switch to the corresponding AWS S3 API to ensure that we don't break larger uploads. The WACZ files can get large (and there's no sizeLimit by default)

ikreymer avatar May 22 '24 23:05 ikreymer