clarin-dspace parallel backend to s3 uploads

we've discussed using TransferManager in the past
- afaik it should also do some sort of checksum validation (check the docs/implementation)
when operating like this we have the whole file from the user, we should compute/display/store the checksum on that

Jan 13 '25 15:01 kosarko

@kosarko (@vidiecan ) I've spent some time trying to use TransferManager directly to upload a large file using multipart upload and then validate the checksum.

As far as I know, there are some limitations that prevent this approach from working effectively:

DSpace 7/8/9 uses AWS SDK v1, which has limitations when it comes to validating checksums during multipart uploads. AWS SDK v2 seems to offer a solution, but I'm not confident enough to upgrade the SDK version in the upstream code. We could consider opening an issue in the Vanilla repository to discuss it.
After calling TransferManager.upload(), the checksum can be retrieved by fetching the metadata of the uploaded file. However, the final checksum is calculated differently compared to how it's calculated locally. Here's an explanation from ChatGPT: ETag Behavior: For objects uploaded using a single PUT operation (i.e., not multipart), the ETag typically represents the MD5 checksum of the object data. However, for multipart uploads—which are automatically used by TransferManager for large files—the ETag is not a straightforward MD5 checksum. Instead, it's a composite of the MD5 checksums of the individual parts, concatenated and then hashed again. This means the ETag won't match the MD5 checksum of the original file.

In our current solution, we compare the checksums of the uploaded parts. As far as I know, this seems to be the only viable way to validate the checksum of an uploaded file using AWS SDK v1.

May 23 '25 08:05 milanmajchrak

I'll check with cesnet, there were some talks of making the file ("total") checksum available in some way.

Meanwhile a comment - java aws sdk v1 is EOL https://aws.amazon.com/blogs/developer/announcing-end-of-support-for-aws-sdk-for-java-v1-x-on-december-31-2025/ . There's a migration tool https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/migration-tool.html , though, i'm not sure how much help it'll be.

May 29 '25 12:05 kosarko

I suggest improving it if CESNET provides a total checksum for the multipart upload or if AWS SDK 2 is implemented in DSpace. I've created an issue to the upstream about AWS SDK 2 https://github.com/DSpace/DSpace/issues/10819

Jun 03 '25 07:06 milanmajchrak

@milanmajchrak the aws sdk2 has been backported to 7.6(.6) - https://github.com/DSpace/DSpace/pull/11532 it's nice that the PR is reasonably small.

Few things I've noticed about it:

creates bucket if not exists (but what about bucket config like versioning and whatever we need for presigned requests)
computes MD5 (via DigestInputStream) in put
about() effectively streams (downloads to /dev/null) the file to get the md5 checksum (checksum checker eventually uses about, so the comparison it does makes sense)

according to https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html

The latest versions of our AWS SDKs and AWS CLI automatically calculate a cyclic redundancy check (CRC)-based checksum for each upload and sends it to Amazon S3. Amazon S3 independently calculates a checksum on the server side and validates it against the provided value before durably storing the object and its checksum in the object's metadata.

So I'm thinking that this gives us something that the user can compute locally (the MD5), we have (implicit) validation that the chunks arrive as sent (default CRC check on the chunks), with checksum checker we effectively stream the file through DigestInputStream which should verify that the downloaded file has the same MD5.

The CRC default is an SDK feature it doesn't required anything new from the service (S3). It should work against ceph the same way.

I'll also comment on https://github.com/dataquest-dev/DSpace/issues/973

Nov 26 '25 15:11 kosarko

@milanmajchrak the aws sdk2 has been backported to 7.6(.6) - DSpace#11532 it's nice that the PR is reasonably small.

Few things I've noticed about it:

creates bucket if not exists (but what about bucket config like versioning and whatever we need for presigned requests)

computes MD5 (via DigestInputStream) in put

about() effectively streams (downloads to /dev/null) the file to get the md5 checksum (checksum checker eventually uses about, so the comparison it does makes sense)

according to https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html

The latest versions of our AWS SDKs and AWS CLI automatically calculate a cyclic redundancy check (CRC)-based checksum for each upload and sends it to Amazon S3. Amazon S3 independently calculates a checksum on the server side and validates it against the provided value before durably storing the object and its checksum in the object's metadata.

So I'm thinking that this gives us something that the user can compute locally (the MD5), we have (implicit) validation that the chunks arrive as sent (default CRC check on the chunks), with checksum checker we effectively stream the file through DigestInputStream which should verify that the downloaded file has the same MD5.

The CRC default is an SDK feature it doesn't required anything new from the service (S3). It should work against ceph the same way.

I'll also comment on dataquest-dev#973

@kosarko Thank you for mentioning the PR. I checked it, and I think it looks better than using AWS SDK v1. However, there is still some work that needs to be done, and it also needs to be tested with CESNET.

What I noticed:

The _put method no longer creates a temporary file, so some changes will be required to support the synchronization feature that stores the file locally.
The _put method also does not compute the checksum of the uploaded file locally. We want to compute it locally and compare it with the checksum generated by S3, so further updates are necessary.
The _about method currently needs to download the entire file to compute the checksum. We also need an update to _put so that it stores the checksum (computed using the correct algorithm) in the object metadata on S3. After this update, the _about method should be able to retrieve the checksum from metadata without downloading the whole file. This will need to be tested with CESNET.
These changes could also be merged upstream.

Dec 11 '25 10:12 milanmajchrak

2. The _put method also does not compute the checksum of the uploaded file locally. We want to compute it locally and compare it with the checksum generated by S3, so further updates are necessary.

The InputStream (which wraps the file) is wrapped in DigestInputStream and computes the MD5, AFAIK, this is the same process which computes the MD5 for files, which we store locally (the difference is that in once case the InputStream is poured onto local disk and in the other case into s3). If we are talking about the ETAGs (checksum of individual parts, and checksum of checksums), the SDK computes and verifies these (if I'm not wrong). We do not store them. Am I missing something?

3. The _about method currently needs to download the entire file to compute the checksum. We also need an update to _put so that it stores the checksum (computed using the correct algorithm) in the object metadata on S3. After this update, the _about method should be able to retrieve the checksum from metadata without downloading the whole file. This will need to be tested with CESNET.

Say I don't trust the storage and want to validate the file is what I think it is, what do I do? If about doesn't download (or stream, as it's not actually stored) the file, then I can't use the checksumchecker, right?

I might have forgotten some context from our previous discussions...

Dec 16 '25 15:12 kosarko