aws-sdk-go-v2 manager.Uploader.Upload silently ignores PutObjectInput.ContentMD5

Describe the bug

When using manager.Uploader to upload files, when an incorrect ContentMD5 is set on the PutObjectInput, the behavior depends on the size of the file being uploaded:

Uploads of small files (single-part) correctly fail.
Uploads of larger files (multi-part) silently and incorrectly succeed.

The behavior of uploading using manager.Uploader when PutObjectInput.ContentMD5 does not match the actual Body being uploaded should be consistent regardless of the Body size.

Silently failing to effect my intention as a user (that the object will only appear in the bucket if its overall md5 matches what I set in ContentMD5) is completely unacceptable.

Version of AWS SDK for Go?

github.com/aws/aws-sdk-go-v2 v0.31.0
github.com/aws/aws-sdk-go-v2/config v0.4.0
github.com/aws/aws-sdk-go-v2/feature/s3/manager v0.2.0
github.com/aws/aws-sdk-go-v2/service/s3 v0.31.0

Version of Go (go version)?

go version go1.15.6 darwin/amd64

To Reproduce (observed behavior)

package main

import (
	"context"
	"crypto/md5"
	"encoding/base64"
	"flag"
	"io"
	"io/ioutil"
	"log"
	"math/rand"
	"os"

	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/feature/s3/manager"
	"github.com/aws/aws-sdk-go-v2/service/s3"
	"github.com/aws/aws-sdk-go-v2/service/s3/types"
)

var (
	uploader *manager.Uploader
	bucket   string
)

func main() {
	flag.StringVar(&bucket, "bucket", "", "s3 bucket to upload to")
	flag.Parse()

	cfg, err := config.LoadDefaultConfig(context.Background())
	if err != nil {
		panic(err)
	}
	uploader = manager.NewUploader(s3.NewFromConfig(cfg))

	log.Printf("bucket=%q", bucket)

	log.Println("testing with small upload:")
	withTestFile(manager.DefaultUploadPartSize/2, testUpload)
	log.Println("testing with large upload:")
	withTestFile(manager.DefaultUploadPartSize*4, testUpload)
}

func testUpload(file *os.File, goodDigest string) {
	ctx := context.Background()

	if _, err := file.Seek(0, io.SeekStart); err != nil {
		panic(err)
	}
	goodInput := &s3.PutObjectInput{
		Bucket:               &bucket,
		Key:                  aws.String("test-file"),
		ContentMD5:           &goodDigest,
		ServerSideEncryption: types.ServerSideEncryptionAes256,
		Body:                 file,
	}
	if _, err := uploader.Upload(ctx, goodInput); err != nil {
		log.Printf("Upload with correct ContentMD5 failed (should not happen): %+v", err)
		return
	}

	badDigest := base64.StdEncoding.EncodeToString(make([]byte, 16))

	if _, err := file.Seek(0, io.SeekStart); err != nil {
		panic(err)
	}
	badInput := &s3.PutObjectInput{
		Bucket:               &bucket,
		Key:                  aws.String("test-file"),
		ContentMD5:           &badDigest,
		ServerSideEncryption: types.ServerSideEncryptionAes256,
		Body:                 file,
	}
	if _, err := uploader.Upload(ctx, badInput); err != nil {
		log.Printf("Upload with incorrect ContentMD5 failed (this should happen): %+v", err)
	} else {
		log.Println("Upload with incorrect ContentMD5 succeeded (THIS SHOULD NOT HAPPEN!)")
	}
}

func withTestFile(desiredSize int64, fun func(file *os.File, md5Digest string)) {
	file, err := ioutil.TempFile("", "")
	if err != nil {
		panic(err)
	}
	defer func() {
		name := file.Name()
		if err := file.Close(); err != nil {
			log.Printf("failed to close %q: %s", name, err)
		}
		if err := os.Remove(name); err != nil {
			if !os.IsNotExist(err) {
				log.Printf("failed to clean up %q", name)
			}
		}
	}()

	h := md5.New()
	var written int64
	const bs = 1024
	buf := make([]byte, bs)
	for written < desiredSize {
		_, _ = rand.Read(buf)
		if _, err := h.Write(buf); err != nil {
			panic(err)
		}
		if _, err := file.Write(buf); err != nil {
			panic(err)
		}
		written += bs
	}

	encoded := base64.StdEncoding.EncodeToString(h.Sum(nil))
	fun(file, encoded)
}

$ go run . -bucket REDACTED
2021/01/13 15:00:12 bucket="REDACTED"
2021/01/13 15:00:12 testing with small upload:
2021/01/13 15:00:15 Upload with incorrect ContentMD5 failed (this should happen): operation error S3: PutObject, https response error StatusCode: 400, RequestID: 77DC5560C1178BED, HostID: y69V71ouG4sdHU8HkutkxrJDPu1FsQoNGLCaTkuSDXpehH5ijldLskcMG6SAJMz01rqiWJwD0aE=, api error BadDigest: The Content-MD5 you specified did not match what we received.
2021/01/13 15:00:15 testing with large upload:
2021/01/13 15:00:33 Upload with incorrect ContentMD5 succeeded (THIS SHOULD NOT HAPPEN!)

Expected behavior

The behavior of the ContentMD5 option when using Uploader.Upload does not vary depending on the size of the body.

Any of the following would fine, the current behavior is not:

The upload attempts with the incorrect ContentMD5 fail and the ones with the correct ContentMD5 succeed, regardless of Body size.
All upload attempts with a non-nil ContentMD5 fail immediately with an error indicating that the option is not supported by Uploader.
Uploader.Upload takes a different input that does not have a ContentMD5 member at all.

Additional context

Reported in aws-sdk-go back in 2019, with no fix: https://github.com/aws/aws-sdk-go/issues/2500

Acknowledged by @jasdel here: https://github.com/aws/aws-sdk-go-v2/issues/343#issuecomment-548099242

Jan 13 '21 23:01 eriksw

This bug also means that it is not possible to use an Uploader to upload large objects to buckets with default object lock enabled, as it ends up calling UploadPart with no ContentMD5 field. This fails with a message like:

InvalidRequest: Content-MD5 HTTP header is required for Put Part requests with Object Lock parameters

Jun 28 '21 01:06 nhinds

I've also recently run into this issue using the manager.Uploader to upload buckets that have an object lock.

Content-MD5 HTTP header is required for Put Object requests with Object Lock parameters

Are there any plans to address this issue or provide a workaround? It seems like this used to not be an issue in v1 due to the computeBodyHashes which no longer seems to be an option.

Any help would be appreciated.

Sep 14 '21 21:09 jfmyers9

Thanks for reporting this issue. The SDK’s S3 upload manager ignores the provided Content-MD5 input value when the content to upload is larger than the upload manager's part size. The upload manager will use multi-part upload in that case, and the content-md5 provided will be ignored since it wouldn't be valid for the individual content's parts.

This is a missing edge case in the upload transfer manager’s inputs and behavior. The best workaround at the moment is to not use the transfer manager for uploads, or to change the transfer manager’s upload part size to be significantly large so that muti-part upload is not performed by the transfer manager. If the transfer manager is not using mutlti-part upload, and instead does object upload, the content-md5 input value should be used in the request.

We're looking at ways to redesign the transfer managers to improve this, and other, issues. The transfer manager right now shouldn't be used if Content-MD5 is required for the object, if multipart upload is being used.

The v2 SDK also supports streaming input for PutObject, and the SDK no longer requires the input to be an io.ReadSeeker.

Dec 29 '21 17:12 jasdel

I consider it a serious bug that the MD5 is ignored.

It's reasonable that the current implementation of the manager doesn't support specifying digests for uploads, but it's not okay that it silently ignores the digest under some circumstances.

I believe the field should either be removed or cause an error in the case where the manager is going to split it into parts and ignore it. Its presence in the input struct implies that the upload will not succeed if this doesn't match, and it's a foot-nuke that likely assumption isn't true.

Dec 29 '21 17:12 eriksw

Better documentation is needed for that field at a minimum, documenting that the MD5 value will not be used if the object being uploaded is larger than the upload manger's part size. (i.e. multi-part upload will be performed). The v2 SDK's transfer manager re-uses the PutObject API call's input type, PutObjectInput which ContentMD5 is valid for a PutObject, but the ContentMD5 value is not valid for a multi-part upload. ContentMD5 in this context is still valid if not using multi-part upload.

While investigating this it looks like the v1 SDK had a customiation papering over this issue by always computing the UploadPart's ContentMD5 parameter. V1 SDK still ignored the ContentMD5 input parameter for multi-part, but would add its own ContentMD5 under the hood.

An updated transfer manager should handle this more obviously. Adding additional documentation to the transfer manager might help, but I think a better fix is closer to your suggestion of the transfer manager needs a more deliberate way of handling provided ContentMD5 based on single part or multipart object uploads.

Dec 29 '21 18:12 jasdel

The v2 SDK's transfer manager re-uses the PutObject API call's input type, PutObjectInput

It's a shame that this wasn't considered before the major version was tagged. I would hope that the next major version makes an effort to avoid re-using structs in contexts where there's the possibility of situations like this (a field where the intent won't always be honored).

As far as better upload manager, what would be fantastic is an interface where the issue of splitting is inherently brought to the caller's awareness, such as: if you want multipart uploading AND to set the digest then you need to provide a slice of digests. Otherwise, explicitly choose one or the other. (Same with sha256's.)

Dec 29 '21 19:12 eriksw

Thanks for reporting this issue, and providing feedback. The latest version of the SDK now includes more descriptive documentation and behavior of the Uploader's handling of ContentMd5.

The SDK now supports checking object integrity, which allows you to configure the SDK to auto compute checksums of the payload being uploaded using the algorithm of your choice via ChecksumAlgorithm input parameter on PutObject and UploadPart operations. For downloading objects, the GetObject operation has a new input parameter ChecksumMode that allows you to enable the SDK to validate the checksum of the object downloaded form Amazon S3.

Feb 25 '22 17:02 jasdel

@jasdel Thanks for the update! In our case, we have an API that allows users to essentially upload to S3, and we'd like to allow them to specify a checksum for integrity checking (not just our service to S3, but end-to-end including the upload to our service).

If I understand correctly, this is still not reasonably possible—for users to calculate the checksum on their end, they'd need knowledge of the part size we configured, correct?

Is there any way to get a normal checksum of the entire object after the upload rather than the "checksum of checksums", even if it's after the upload finished?

Mar 01 '22 13:03 Airblader

The V2 SDK can now automatically compute the checksums for individual parts as they are uploaded to S3. When the upload is complete a "checksum-of-checksums" will be included in the UploadOutput. Unlike Content-MD5, the new algorithm checksums can be computed locally on the object is uploaded, but yes, does require knowledge of the part sizes being used for the upload.

Due to how the checksums are handled by the service multi-part objects will not have a "total object" checksum, only a checksum of the individual part checksums.

Mar 23 '22 20:03 jasdel

Thank you for the confirmation, @jasdel. What about this?

Is there any way to get a normal checksum of the entire object after the upload rather than the "checksum of checksums", even if it's after the upload finished?

(Of course with the addition of "without having to fetch the entire object and compute it locally").

Mar 24 '22 07:03 Airblader

Is there any way to get a normal checksum of the entire object after the upload rather than the "checksum of checksums", even if it's after the upload finished?

No, not directly in the SDK, especially the multipart uploader. Using PutObject will upload content as single object, but you'll lose out on the concurrent uploads, and will run into max single object size limitations.

Mar 24 '22 17:03 jasdel

I don't think this solves the use case: Specifically, the purpose of setting a content hash that is already known is that I need a guarantee that the upload will fail if the contents in the bucket don't wind up matching that.

If there's a MD5 field in the input to the Upload function, all I care about is that there is no way you could ever wind up with a file in the bucket that if I downloaded it would have something other than that MD5.

It seems like given the realities of multipart upload, that that would imply that there should not be an MD5 field at all? Shouldn't there instead be a field for me to specify the hash-of-hashes or whatever it is instead?

Mar 24 '22 17:03 eriksw

I don't think this solves the use case

Yes, I feel the same way. We essentially still have no way to enforce data consistency without having to expose configuration internals and duplicating the algorithm on how to calculate the checksum.

Mar 24 '22 17:03 Airblader

Please don't forget about the "This bug also means that it is not possible to use an Uploader to upload large objects to buckets with default object lock" aspect of this. Or should someone open a separate issue for that?

May 10 '23 17:05 vangent

aws-sdk-go-v2 aws-sdk-go-v2 copied to clipboard

manager.Uploader.Upload silently ignores PutObjectInput.ContentMD5

aws-sdk-go-v2
aws-sdk-go-v2 copied to clipboard