cromwell AWS backend - etag differences in copied files lead to cache misses

In at least one example (I have the workflow & broad accessible inputs if someone needs them) the following leads to a cache miss:

Workflow A, Task A: Output is multipart uploaded to S3 and has an etag to match Workflow A, Task B: Consumes this file as an input

Workflow B, Task A: Cache hit. Copies that file, but receives an md5 etag Workflow B, Task B: Cache miss - the two etags are now mismatched

It seems highly likely that a solution to this would also solve #4805

Apr 12 '19 13:04 geoffjentry

From the docs, emphasis mine:

After a successful complete request, the parts no longer exist. Your complete multipart upload request must include the upload ID and a list of both part numbers and corresponding ETag values. Amazon S3 response includes an ETag that uniquely identifies the combined object data. This ETag will not necessarily be an MD5 hash of the object data.

Source

Thinking out loud: Do we have to recreate the original multipart upload to get the same ETag deterministically? Or otherwise how can we get the combined ETag deterministically?

If not, we can use UploadPartCopyRequest to accomplish this here.

Example multipart copy using old SDK Migration guide from 1.1 to 2.0 (to interpret above in 2.0)

Apr 16 '19 18:04 danbills

We've implemented a fix for the Cromwell-side of things to work w/ AWS CLI default settings for S3.

These default settings are good for up to 83 GiB. The old limit was 5 GiB.

At 83 GiB the AWS CLI has to alter the default part size from 8 MiB to something larger and it's not clear what that logic is.

What we want to do from here forward

The proxy should use larger part sizes all the time, ideally the largest part size of 5 GiB. Then w/ AWS limit of 10K parts we have a new limit of ~ 53 TiB.

We attempted to alter the proxy to use this part size and set the threshold to 5 GB before using multipart and it broke the proxy. In order to find out what's happening we need to investigate proxy logs.

Apr 18 '19 20:04 danbills

I think two things here were premature:
a) Me saying "we've implemented" as the branch was not merged. b) Marking this ticket as "Done"

As this happened 6 months ago, I'm a little hazy on details but as I noted above the code we wrote "broke the proxy," and apparently it wasn't clear why as I noted an investigation is in order.

I will do my best to find that branch and surface it here.

Sep 12 '19 14:09 danbills

I found this branch that we ultimately reverted due to it breaking the proxy.

Sep 12 '19 14:09 danbills

Hi is this implemented in the latest version of cromwell? I am getting the following error for files > 5G with the latest version

2019-10-31 04:31:17,243 cromwell-system-akka.dispatchers.engine-dispatcher-32 WARN - 85d92e7d-3017-4e8d-adac-551ebcd50165-EngineJobExecutionActor-jgi_meta.bbcms:NA:1 [UUID(85d92e7d)]: Failed copying cache results for job BackendJobDescriptorKey_CommandCallNode_jgi_meta.bbcms:-1:1 (EnhancedCromwellIoException: [Attempted 1 time(s)] - S3Exception: The specified copy source is larger than the maximum allowable size for a copy source: 5368709120 (Service: S3, Status Code: 400, Request ID: 1272B7BFF87110E8)), invalidating cache entry.

Nov 01 '19 22:11 bfoster-lbl

Is there any progress on this? It's effectively disables call-caching when running on AWS batch and would be great to get a fix for this in upcoming versions.

Jan 18 '20 20:01 alexwaldrop

There hasn't unfortunately. This is sort of a desperate thought but maybe you can try to see if you can cache via file path instead of file hash (https://cromwell.readthedocs.io/en/develop/Configuring/#local-filesystem-options). I don't know if its configured to work for AWS but "hashing-strategy" could be set to path -- though it may only ever work for the local filesystem and not S3.

              # Possible values: file, path, path+modtime
              # "file" will compute an md5 hash of the file content.
              # "path" will compute an md5 hash of the file path. This strategy will only be effective if the duplication-strategy (above) is set to "soft-link",
              # in order to allow for the original file path to be hashed.
              # "path+modtime" will compute an md5 hash of the file path and the last modified time. The same conditions as for "path" apply here.
              # Default: file
              hashing-strategy: "file"

Feb 04 '20 13:02 ruchim

Cromwell-59 still has this problem. cacheCopy >= 5MB always has a different etag so it effectively disables call-caching due to different etag.

So the original file's etag doesn't have - in etag (no multipart uploading).

$ aws s3api head-object --bucket encode-processing --key test-copy-etag/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz
{
    "AcceptRanges": "bytes",
    "LastModified": "Wed, 26 May 2021 18:26:37 GMT",
    "ContentLength": 164184869,
    "ETag": "\"0502111c7c676115303cca9931c2769b\"",
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

Tried to copy it to a temp location (mimicking cacheCopy).

$ aws s3 cp s3://encode-processing/caper_out_v052521/atac/d249d56c-fcdb-4916-a830-2c191920d877/call-bam2ta/shard-0/glob-199637d3015dccbe277f621a18be9eb4/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz s3://encode-processing/test-copy-etag/
copy: s3://encode-processing/caper_out_v052521/atac/d249d56c-fcdb-4916-a830-2c191920d877/call-bam2ta/shard-0/glob-199637d3015dccbe277f621a18be9eb4/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz to s3://encode-processing/test-copy-etag/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz

$ aws s3api head-object --bucket encode-processing --key test-copy-etag/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz
{
    "AcceptRanges": "bytes",
    "LastModified": "Wed, 26 May 2021 18:27:41 GMT",
    "ContentLength": 164184869,
    "ETag": "\"a11d42b4abf4ef9d5de40183f25c520b-20\"",
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

Got a different etag which matches with etag of the above snapshot.

If copy method is used for s3.caching.duplication-strategy then call-caching is effectively disabled for all files > 5MB.

There is another bug in AWS backend's s3.caching.duplication-strategy. https://github.com/broadinstitute/cromwell/issues/6327

This bug always fixes the duplication strategy as copy. So call-caching always fails on AWS for files > 5MB.

May 26 '21 18:05 leepc12

Old question, but on AWS, try the AWS-tailored fork : https://github.com/henriqueribeiro/cromwell/blob/develop_aws/supportedBackends/aws/src/main/scala/cromwell/backend/impl/aws/README.md

This issue is handled there by correcting the multipart size in cromwell to match the AWS defaults

Sep 18 '23 06:09 geertvandeweyer

cromwell cromwell copied to clipboard

AWS backend - etag differences in copied files lead to cache misses

What we want to do from here forward

cromwell
cromwell copied to clipboard