cloudbridge icon indicating copy to clipboard operation
cloudbridge copied to clipboard

File integrity check

Open VJalili opened this issue 7 years ago • 6 comments

Maybe it would be useful if we could expose means of assessing the integrity of a file. I can think of the following scenarios:

  1. check if a file is correctly uploaded to S3. For this, first calculate the file's checksum using a hash function (e.g., MD5, SHA256), then compare it with S3 generated checksum. If the two hash values match, then the file is correctly upload; otherwise retry uploading the file.

  2. Get a file's checksum without download the file (e.g., see this). A good scenario for this would checking if a local copy match S3 version, if not, then a collaborator has changed the file, hence update the local copy.

It seems boto is exposing MD5 checksum of an object, however, I guess the challenge would be to (1) ensue if/how other providers expose checksum without downloading a file/object; (2) are all the providers use common hash function for checksum?

VJalili avatar Sep 19 '17 06:09 VJalili

I think this would be a really nice feature to have. There are several hurdles to overcome.

  1. The first one is the issue that you highlighted - do all providers expose the same cheksum?: They do appear to if the file is not chunked as far as I can see. The ETag exposes the md5 sum. However, for chunked files, it's a mixed bag: aws, azure, gce, openstack

  2. The checksum for large, chunked objects is dependent on the chunk size, as each chunk is hashed separately. This means that the user must provide the original chunk size that was used, or no meaningful comparison can happen. This is probably not too big a deal if all objects were uploaded using cloudbridge, since we can use a fixed CHUNK_SIZE per provider. If not, we'd have to have the user provide it as a parameter. One good thing is that all the providers do appear to have a mechanism for getting the hashes of individual chunks without downloading the file, at least at first glance.

  3. The first and second issues imply that cloudbridge must also provide a function for hashing the local file, since the hashing will be provider specific. Something like

    # Get the hash of a remote bucket object
    remote_hash = BucketObject.get_hash()

    # Compute the hash of a local file
    local_hash = BucketObject.compute_hash(local_file_path, chunk_size=some_provider_default)

    # compare
    remote_hash == local_hash?
  1. We can avoid having to provide a local compute_hash function if we assume that the user must cache the original hash somewhere, rather than recomputing it. This would probably be desirable initially, since it would reduce complexity considerably. However, that won't help in scenario 1 you highlighted - checking whether an object was uploaded correctly.

nuwang avatar Sep 19 '17 07:09 nuwang

  1. The first one is the issue that you highlighted - do all providers expose the same cheksum?: They do appear to if the file is not chunked as far as I can see. The ETag exposes the md5 sum. However, for chunked files, it's a mixed bag: aws, azure, gce, openstack

This is gonna be tricky :(

  1. The checksum for large, chunked objects is dependent on the chunk size, as each chunk is hashed separately. This means that the user must provide the original chunk size that was used, or no meaningful comparison can happen. This is probably not too big a deal if all objects were uploaded using cloudbridge, since we can use a fixed CHUNK_SIZE per provider. If not, we'd have to have the user provide it as a parameter. One good thing is that all the providers do appear to have a mechanism for getting the hashes of individual chunks without downloading the file, at least at first glance.

Let me highlight two points: (a) I think we would better expose one checksum regardless of the file size, I guess we get one checksum for an object from S3 no matter what the size is; (b) it is a very strong assumption that the file is upload by cloudbridge. I'm assuming that cloudbridge can download any object from S3 no matter how the object was uploaded. If this assumption is correct, then we should try not to invalidate it for some functionalities (e.g., get checksum).

  1. The first and second issues imply that cloudbridge must also provide a function for hashing the local file, since the hashing will be provider specific.

I would say it would be simpler/logical to just expose the hash key we're getting for an object from the provider, and avoid being in the job of creating such hash keys (not even at the level of a simple interface to hashlib), at least for the time being.

  1. We can avoid having to provide a local compute_hash function if we assume that the user must cache the original hash somewhere, rather than recomputing it. This would probably be desirable initially, since it would reduce complexity considerably. However, that won't help in scenario 1 you highlighted - checking whether an object was uploaded correctly.

Similar to my previous comment, lets avoid caching checksums of uploaded/downloaded files at the cloudbridge level. I guess cloudbridge does not cache/store any attribute of uploaded/downloaded files.

VJalili avatar Sep 19 '17 08:09 VJalili

(a) I think we would better expose one checksum regardless of the file size, I guess we get one checksum for an object from S3 no matter what the size is;

Even for chunked objects, I think a single checksum is exposed by the providers (That single checksum appears to be calculated by in turn hashing the checksums of individual chunks, but we only need to care about that if we have to reproduce the checksum ourselves I guess).

I'm assuming that cloudbridge can download any object from S3 no matter how the object was uploaded. If this assumption is correct, then we should try not to invalidate it for some functionalities (e.g., get checksum).

Yes, that's correct, and agree.

Similar to my previous comment, lets avoid caching checksums of uploaded/downloaded files at the cloudbridge level. I guess cloudbridge does not cache/store any attribute of uploaded/downloaded files.

I agree that checksums should not be cached by cloudbridge and nor can it. Therefore, if a cloudbridge client/user wants to compare a local file with a remote file, they must themselves maintain a cached checksum, or we must provide a mechanism for computing that checksum as far as I can see.

For the scenario where the user wants to validate that an upload has been performed with no data errors, that could be handled separately, say by providing a verify=True/False parameter in the upload function. Each provider can then do its own verification.

nuwang avatar Sep 19 '17 09:09 nuwang

For the scenario where the user wants to validate that an upload has been performed with no data errors, that could be handled separately, say by providing a verify=True/False parameter in the upload function. Each provider can then do its own verification.

It would be awesome if we could implement "upload verification" internally leveraging on checksum. Such that, we compute a checksum of the file before uploading, then compare it with the provider provided checksum, then if match: return true from the upload function; otherwise: return false. Accordingly, the user does not need to know what mechanism we're using to verify integrity of the upload, but can trust the verification :)

VJalili avatar Sep 19 '17 16:09 VJalili

It looks like boto is verifying each chunk already, so we may not need to do anything for AWS. Swift does the same and supports a checksum=True/False parameter to control it, but it's on by default (relevant line of code). I'm guessing AWS and GCE are likely to be the same.

nuwang avatar Sep 20 '17 07:09 nuwang

That solves the integrity verification of the uploaded file. So, we just need to inform a user (in documentation maybe) that upload verification is checksum-based. However, we are still gonna need exposing checksum of an uploaded file.

VJalili avatar Sep 20 '17 13:09 VJalili