wdl
wdl copied to clipboard
Determining input equivalence with File checksums and Directory listings
A shortcoming of the File
and Directory
data types is the inability to guarantee the equivalence of inputs, which breaks reproducibility and makes the implementation of call caching (which we call "job reuse" on DNAnexus) difficult. Say I have the following directory in my cloud storage:
foo
|_bar
|_baz
I run a job with a Directory
-type input and provide foo
as the value. Now I add a new file blorf
to the foo
directory, and I replace baz
with a new file of the same name but with different contents. I run the job again. Given just the directory name, how does my implementation know that the contents of that directory have remained unchanged since the first time I ran the job?
I propose to formally define two alternative JSON formats for the File
and Directory
types in the standard input/output formats. Rather than reinvent the wheel, we will just borrow the terminology from CWL.
{
"file1": "/path/to/file",
"dir1": "/path/to/dir",
"dir2": {
"location": "/path/to/dir",
"listing": [
{
"type": "File",
"basename": "foo.txt",
"checksum": "sha256:ABC123"
},
{
"type": "Directory",
"basename": "bar"
"listing": [
{
"type": "File",
"basename": "baz.txt",
"checksum": "md5:WTFBBQ42"
}
]
}
]
},
"dir3": {
"basename": "fakedir",
"listing": [
{
"type": "File",
"location": "/path/to/foo.txt",
"basename": "bar.txt",
"checksum": "sha256:ABC123"
},
]
}
}
In the simple form, a File
/Directory
value is just a string - typically a local path or a URI. The object forms have the following fields:
-
File
-
type
: Always "File"; optional at the top-level but required within directory listings -
location
: The file location - this is equivalent to the value in the simple form. May be absent if the file is within a listing as long asbasename
is specified. -
basename
: The name of the file relative to the containing directory. If the basename differs from the actual file name at the given location, the file must be localized with the given basename. -
checksum
: A checksum of the file using one of the approved algorithms. If specified, the checksum must be verified during localization.
-
-
Directory
-
type
: Always "Directory"; optional at the top-level but required within directory listings -
location
: The directory location - this is equivalent to the value in the simple form. May be absent if the directory is within a listing as long asbasename
is specified. -
basename
: The name of the directory relative to the containing directory. If the basename differs from the actual directory name at the given location, the file must be localized with the given basename. Iflocation
is not specified, thenbasename
andlisting
are required, and all files/directories in the listing must have a location that is an absolute path or URI. -
listing
: An array of files/subdirectories within the directory. May be nested to any degree.
-
Importantly, none of these fields will be exposed within WDL, so the runtime definition of File
/Directory
won't change.
Draft implementation: https://github.com/openwdl/wdl/tree/472-directory-listing
This is a great idea @jdidion and is something I have been wondering about myself. One quick question I have here, would you consider exposing some of those properties within wdl proper. A File might only ever be assigned with a url, but once defined should it have attributes (checksum, basename, size even?).
Well, remember that we already have a size
function. If we want to expose any other attributes (which I don't think is necessary), we should also do so with functions.
You are correct (My late night brain jumped a bit to far down the OOP path)
If we are going for checksums might I jump in and prevent us from using SHA256? That is a cryptographic hash. Cryptographic hashes are designed to be slow, as to prevent brute-force attacks. That is overkill for a file. We just want a cyclic redundancy check to make sure the file is the same. In bioinformatics we use 50GB+ files quite often so it makes quite a difference whether a fast or slow hash is used.
May I suggest using XXHash? It is extremely fast. I already implemented XXHash into cromwell. As Java-bindings and Python bindings are available it is really no strain on engine developers.
QED (benchmarks with hyperfine):
$ du -h big2.fastq.gz
657M big2.fastq.gz
Benchmark #1: md5sum big2.fastq.gz
Time (mean ± σ): 939.8 ms ± 12.8 ms [User: 891.8 ms, System: 48.0 ms]
Range (min … max): 919.2 ms … 958.1 ms 10 runs
Benchmark #1: sha1sum big2.fastq.gz
Time (mean ± σ): 925.2 ms ± 10.5 ms [User: 878.7 ms, System: 46.4 ms]
Range (min … max): 903.9 ms … 941.8 ms 10 runs
Benchmark #1: sha256sum big2.fastq.gz
Time (mean ± σ): 2.322 s ± 0.024 s [User: 2.273 s, System: 0.049 s]
Range (min … max): 2.295 s … 2.365 s 10 runs
Benchmark #1: sha384sum big2.fastq.gz
Time (mean ± σ): 1.596 s ± 0.008 s [User: 1.552 s, System: 0.044 s]
Range (min … max): 1.587 s … 1.614 s 10 runs
Benchmark #1: sha512sum big2.fastq.gz
Time (mean ± σ): 1.611 s ± 0.014 s [User: 1.573 s, System: 0.038 s]
Range (min … max): 1.582 s … 1.632 s 10 runs
Benchmark #1: xxh32sum big2.fastq.gz
Time (mean ± σ): 138.5 ms ± 11.1 ms [User: 91.1 ms, System: 47.4 ms]
Range (min … max): 127.2 ms … 150.0 ms 10 runs
Benchmark #1: xxh64sum big2.fastq.gz
Time (mean ± σ): 99.1 ms ± 9.4 ms [User: 47.8 ms, System: 51.3 ms]
Range (min … max): 84.6 ms … 106.4 ms 10 runs
Benchmark #1: xxh128sum big2.fastq.gz
Time (mean ± σ): 84.9 ms ± 8.6 ms [User: 37.2 ms, System: 47.7 ms]
Range (min … max): 72.6 ms … 91.5 ms 10 runs
xxh128sum is 12x faster than md5sum and 30x faster than sha256sum. I think sha512sum is a lot faster than sha256sum is due to hardware optimizations.
Also 64-bit and 128-bit hashes can be represented as 16-char and 32-char hex-strings which is much easier to type than 64-char hex-strings. (I prefer the 16-char ones, much easier to check for typos/copy-paste errors!).
Hi all, just a comment here, I'd consider generalizing/weakening this to say that the File/Directory representation may be a JSON object with a location
key and whatever else the engine may wish to include or interpret. Going beyond that may be too prescriptive of implementation details. For example, miniwdl's call cache just uses the filesystem mtimes instead of digests.
@mlin Good point. Checksum shouldn't be required. But we can make a suggestion to use checksums or some other means of determining file equality. And to Ruben's point we can suggest checksum algorithms, but require that any specific algorithm be used.