s4cmd icon indicating copy to clipboard operation
s4cmd copied to clipboard

Upload is slower for very large because of md5 calculation

Open onilton opened this issue 11 years ago • 2 comments

For larger file uploads, takes too long mainly because of file_hash (md5) calculation.

This is line is the line responsible for this mpu = bucket.initiate_multipart_upload(s3url.path, metadata = {'md5': self.file_hash(source), 'privilege': self.get_file_privilege(source)})

https://github.com/bloomreach/s4cmd/blob/master/s4cmd.py#L1043

I've read in aws documentation that this md5 metadata is not necessary, I suspect this md5 calculation exists mainly because local md5 of file is different from the generated by S3.

It is used in sync_check: ('md5' in remoteKey.metadata and remoteKey.metadata['md5'] == localmd5)

It is useful mostly for calls that use sync option.

But sometimes, like in my case, the user of s4cmd may not care about sync, since he may using the -f (force) option.

I thought about two possible solutions:

  • One argument to just disable this file_hash execution like "--disable-multipart-meta-md5"
  • The second one is more ambitious, and I don't know if it is really possible:

Don't calculate the md5, but uses s3's generated md5 for calculation. (We will need to reproduce s3's md5 calculation)

S3 calculates the s3 etag for multipart uploads using a md5 of the md5 of splitted parts. The algorithm is described in these links:

http://stackoverflow.com/questions/6591047/etag-definition-changed-in-amazon-s3 http://permalink.gmane.org/gmane.comp.file-systems.s3.s3tools/583 http://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb https://forums.aws.amazon.com/thread.jspa?messageID=203510

The thing that is missing from to recalculate s3's md5 for multipart is the SPLIT_SIZE.

But s3's e-tag for file has a sufix -NUMBER_OF_PARTS.

s3cmd --recursive --list-md5 ls s3://YOURBUCKET/YOURDIR/ 
2015-02-13 05:02 9791602688   f938b15b2edef7d2c23542814bdcb6af-187  s3://FILEPATH

I guess with this info size (and number of parts), we could use to calculate like amazon's s3 does

In the example above:

parts = 187
file_size = toMB(9791602688.0D) # 9338.0

SPLIT_SIZE = ((file_size - (file_size % parts)) / parts) + 1
SPLIT_SIZE = 50 # MB

onilton avatar Feb 13 '15 22:02 onilton

Thanks for the information. I couldn't find information about the new etag year ago.

We can definitely fix the check for multipart uploads. But the core issue as you pointed out here is the md5 check for large files. Bypass md5 check is one of the options. By setting --sync-check to false we ignore the md5 checks all together (except for sync command). It is hard trade-off since md5 checks are suppose to safe time by skipping downloads, but in the case of very large file, it seems unavoidable.

chouhanyang avatar Feb 17 '15 16:02 chouhanyang

My use case is dumping backups to S3. I have relied on this flag in s3cmd:

  --no-check-md5        Do not check MD5 sums when comparing files for [sync].
                        Only size will be compared. May significantly speed up
                        transfer but may also miss some changed files.

Ideal might be to compare timestamps only, but this could be weird on S3. Even reading the LOCAL files to compute md5 is crazy expensive. :/

dannyman avatar Aug 27 '16 00:08 dannyman