cromwell
cromwell copied to clipboard
AWS backend - etag differences in copied files lead to cache misses
In at least one example (I have the workflow & broad accessible inputs if someone needs them) the following leads to a cache miss:
Workflow A, Task A: Output is multipart uploaded to S3 and has an etag to match Workflow A, Task B: Consumes this file as an input
Workflow B, Task A: Cache hit. Copies that file, but receives an md5 etag Workflow B, Task B: Cache miss - the two etags are now mismatched
It seems highly likely that a solution to this would also solve #4805
From the docs, emphasis mine:
After a successful complete request, the parts no longer exist. Your complete multipart upload request must include the upload ID and a list of both part numbers and corresponding ETag values. Amazon S3 response includes an ETag that uniquely identifies the combined object data. This ETag will not necessarily be an MD5 hash of the object data.
Thinking out loud: Do we have to recreate the original multipart upload to get the same ETag deterministically? Or otherwise how can we get the combined ETag deterministically?
If not, we can use UploadPartCopyRequest to accomplish this here.
Example multipart copy using old SDK Migration guide from 1.1 to 2.0 (to interpret above in 2.0)
We've implemented a fix for the Cromwell-side of things to work w/ AWS CLI default settings for S3.
These default settings are good for up to 83 GiB. The old limit was 5 GiB.
At 83 GiB the AWS CLI has to alter the default part size from 8 MiB to something larger and it's not clear what that logic is.
What we want to do from here forward
The proxy should use larger part sizes all the time, ideally the largest part size of 5 GiB. Then w/ AWS limit of 10K parts we have a new limit of ~ 53 TiB.
We attempted to alter the proxy to use this part size and set the threshold to 5 GB before using multipart and it broke the proxy. In order to find out what's happening we need to investigate proxy logs.
I think two things here were premature:
a) Me saying "we've implemented" as the branch was not merged.
b) Marking this ticket as "Done"
As this happened 6 months ago, I'm a little hazy on details but as I noted above the code we wrote "broke the proxy," and apparently it wasn't clear why as I noted an investigation is in order.
I will do my best to find that branch and surface it here.
I found this branch that we ultimately reverted due to it breaking the proxy.
Hi is this implemented in the latest version of cromwell? I am getting the following error for files > 5G with the latest version
2019-10-31 04:31:17,243 cromwell-system-akka.dispatchers.engine-dispatcher-32 WARN - 85d92e7d-3017-4e8d-adac-551ebcd50165-EngineJobExecutionActor-jgi_meta.bbcms:NA:1 [UUID(85d92e7d)]: Failed copying cache results for job BackendJobDescriptorKey_CommandCallNode_jgi_meta.bbcms:-1:1 (EnhancedCromwellIoException: [Attempted 1 time(s)] - S3Exception: The specified copy source is larger than the maximum allowable size for a copy source: 5368709120 (Service: S3, Status Code: 400, Request ID: 1272B7BFF87110E8)), invalidating cache entry.
Is there any progress on this? It's effectively disables call-caching when running on AWS batch and would be great to get a fix for this in upcoming versions.
There hasn't unfortunately. This is sort of a desperate thought but maybe you can try to see if you can cache via file path instead of file hash (https://cromwell.readthedocs.io/en/develop/Configuring/#local-filesystem-options). I don't know if its configured to work for AWS but "hashing-strategy" could be set to path -- though it may only ever work for the local filesystem and not S3.
# Possible values: file, path, path+modtime
# "file" will compute an md5 hash of the file content.
# "path" will compute an md5 hash of the file path. This strategy will only be effective if the duplication-strategy (above) is set to "soft-link",
# in order to allow for the original file path to be hashed.
# "path+modtime" will compute an md5 hash of the file path and the last modified time. The same conditions as for "path" apply here.
# Default: file
hashing-strategy: "file"
Cromwell-59 still has this problem. cacheCopy
>= 5MB always has a different etag so it effectively disables call-caching due to different etag.
So the original file's etag doesn't have -
in etag (no multipart uploading).
$ aws s3api head-object --bucket encode-processing --key test-copy-etag/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz
{
"AcceptRanges": "bytes",
"LastModified": "Wed, 26 May 2021 18:26:37 GMT",
"ContentLength": 164184869,
"ETag": "\"0502111c7c676115303cca9931c2769b\"",
"ContentType": "binary/octet-stream",
"ServerSideEncryption": "AES256",
"Metadata": {}
}
Tried to copy it to a temp location (mimicking cacheCopy
).
$ aws s3 cp s3://encode-processing/caper_out_v052521/atac/d249d56c-fcdb-4916-a830-2c191920d877/call-bam2ta/shard-0/glob-199637d3015dccbe277f621a18be9eb4/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz s3://encode-processing/test-copy-etag/
copy: s3://encode-processing/caper_out_v052521/atac/d249d56c-fcdb-4916-a830-2c191920d877/call-bam2ta/shard-0/glob-199637d3015dccbe277f621a18be9eb4/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz to s3://encode-processing/test-copy-etag/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz
$ aws s3api head-object --bucket encode-processing --key test-copy-etag/ENCFF641SFZ.trim.srt.nodup.no_chrM_MT.tn5.tagAlign.gz
{
"AcceptRanges": "bytes",
"LastModified": "Wed, 26 May 2021 18:27:41 GMT",
"ContentLength": 164184869,
"ETag": "\"a11d42b4abf4ef9d5de40183f25c520b-20\"",
"ContentType": "binary/octet-stream",
"ServerSideEncryption": "AES256",
"Metadata": {}
}
Got a different etag which matches with etag of the above snapshot.
If copy
method is used for s3.caching.duplication-strategy
then call-caching is effectively disabled for all files > 5MB.
There is another bug in AWS backend's s3.caching.duplication-strategy
.
https://github.com/broadinstitute/cromwell/issues/6327
This bug always fixes the duplication strategy as copy
. So call-caching always fails on AWS for files > 5MB.
Old question, but on AWS, try the AWS-tailored fork : https://github.com/henriqueribeiro/cromwell/blob/develop_aws/supportedBackends/aws/src/main/scala/cromwell/backend/impl/aws/README.md
This issue is handled there by correcting the multipart size in cromwell to match the AWS defaults