s3cmd icon indicating copy to clipboard operation
s3cmd copied to clipboard

Add config to avoid checking MD5 on get and put

Open gaul opened this issue 8 years ago • 7 comments

Overload existing --no-check-md5. Object stores like S3Proxy cannot return an MD5 ETag using some providers:

  • Atmos returns nothing
  • Azure returns an opaque ETag
  • B2 returns a SHA1

gaul avatar Oct 30 '17 00:10 gaul

Thanks for the PR. Looking at all the difference regarding the "etag" handling, maybe we can do something more future proo with a little change to your PR?

Let's remove the relation to no-check-md5, and create a new config key like: etag_type, etag_mode, ... Its default value could be: etag_mode=md5 (Having the same behavior as the existing one of s3cmd)

And in your case, we can use a value like: etag_mode=none (or etag_mode=) (or etag_mode=other) (Having a behavior that ignores etag, that is what you want with your pr i think).

With that, in the future, we could have something like: etag_mode=sha1 to treat sha1 etags like B2 and so.

fviard avatar Oct 30 '17 11:10 fviard

Disabling ETag is really the only sane choice. Some background, S3Proxy builds on top of Apache jclouds, a cross-cloud library, and jclouds returns a String from putBlob calls. This correlates with ETag for most providers, but jclouds returns the object name for Atmos and the SHA-1 for B2. One could argue these should return nothing since they cannot be meaningfully interpreted or used for things like conditional gets. Azure actually returns its ETag, but it is not an MD5. Thus there really is no precedent for non-MD5 ETags and adding extra ETag modes to s3cmd supports a very small population.

gaul avatar Oct 30 '17 18:10 gaul

Most providers implementing the Amazon s3 protocol, are respecting the aws s3 behavior. The issue is more with providers doing gateway with other protocols that not necessarily have the data. But these are not the main ones. So, in my opinion, that is a lot better to try to support as much have them, maybe by having custom behaviors. (And probably one setting in the case where etag is not supported or we don't know the mode)

I say "md5", but we can say "md5s3". Even if they allows themselves to change in the future, the algo is pretty deterministic and never changed so far. There is even some way to check the md5 for multipart files if you have a "fixed" block size.

Anyway, why the etag is useful?

  • First, like in the present case, that can validate that the received/stored file is not corrupted.
  • Using that makes "sync" a lot more efficient, because as there is no checksum or custom header in the default set of headers returned with a list, it means that we have to issue individual HEAD requests for each file to know his checksum to determine if it has to be uploaded or not. With the ETag, we can do that only for multipart files, and a smart server could be able to take advantage of that to have the multi part files also supported.

fviard avatar Oct 30 '17 18:10 fviard

HTTPS should provide sufficient protection against corruption. The Azure ETag is opaque and includes some kind of timestamp it seems. The other two providers do not actually return an ETag header and the S3Proxy/jclouds behavior is bogus and likely should not return any header. Thus there is no work to do here except adding an ignore flag as this pull request does.

gaul avatar Oct 30 '17 18:10 gaul

In my opinion, in such a case, https is not necessarily sufficient. If we assume that the etag has a format that can be used, like the one with aws:

  1. Nothing prevents to use a simple http connection (and not an https), that will be especially the case with simple s3 clones used on local networks.
  2. https integrity check is not enough. Most services like aws, azure, ... have an https gateway as a frontend on the internet, and then the request is forwarded to different servers/services on the internal network. Normally, what is put in the etag by most services is dependent on the really stored file.

Regarding the patch, in my opinion it is a new config just to deal with specific lines that conflict with servers that have a different behavior. But nothing that would be really clear for an user or future proof. Like, if you consider that the etag is not reliable for some services, why still letting it be used for "sync" file changes calculus, and so.

So, that's why i think that adding something more generic stating what is the supposed meaning of the ETag for a specific service could fix your issue and even have an added value. (One of its value can be "None", "unknown", "garbage" thus giving the hint to s3cmd that it should not be used anywhere, even allowing to indicate to the user why his "sync" could be slower than with another service)

fviard avatar Nov 01 '17 16:11 fviard

Checking ETag is not too useful given that most interesting uploads use multipart which also does not have a meaningful value. Most S3 tools which check ETag have some way to disable it, for example, aws/aws-sdk-java#560. If you will not accept this pull request as-is, please close it.

gaul avatar Nov 01 '17 17:11 gaul

ETag check is ignored automatically if you add <unique-value>-N at the end @gaul @fviard

harshavardhana avatar Aug 07 '18 16:08 harshavardhana