browsertrix-crawler
browsertrix-crawler copied to clipboard
replace minio client with aws sdk
The Minio client has issues connecting to different providers (e.g. Cloudflare R2, see https://github.com/minio/minio-js/issues/619#issuecomment-326158139). Moving to the AWS SDK solves that problem. This should also support files up to 5GB so it helps with https://github.com/webrecorder/browsertrix-crawler/issues/479 but doesn't solve it completely, as for bigger ones (up to 5TB) the multipart upload is probably needed. However it's a first step toward solving the issue of big files and it's already a big improvement over the current Minio client as it allows to use S3 compatible providers different from AWS and Minio.
Thanks, we're just doing some additional testing and should be able to merge this soon.
It does seem like Minio's fputObject
will not do automated multi-part upload, while AWS S3 Client has:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#upload-property which is a more higher-level command that can automatically use multipart upload if necessary. That may address some of the remaining issues, if we were to switch to use that.
It does seem like Minio's
fputObject
will not do automated multi-part upload
Er, no, that's not quite right. fPutObject does support multi-part upload (was it working previously?) https://min.io/docs/minio/linux/developers/javascript/API.html#fPutObject so we probably do want to switch to the corresponding AWS S3 API to ensure that we don't break larger uploads. The WACZ files can get large (and there's no sizeLimit by default)