wdl icon indicating copy to clipboard operation
wdl copied to clipboard

Determining input equivalence with File checksums and Directory listings

Open jdidion opened this issue 2 years ago • 6 comments

A shortcoming of the File and Directory data types is the inability to guarantee the equivalence of inputs, which breaks reproducibility and makes the implementation of call caching (which we call "job reuse" on DNAnexus) difficult. Say I have the following directory in my cloud storage:

foo
|_bar
|_baz

I run a job with a Directory-type input and provide foo as the value. Now I add a new file blorf to the foo directory, and I replace baz with a new file of the same name but with different contents. I run the job again. Given just the directory name, how does my implementation know that the contents of that directory have remained unchanged since the first time I ran the job?

I propose to formally define two alternative JSON formats for the File and Directory types in the standard input/output formats. Rather than reinvent the wheel, we will just borrow the terminology from CWL.

{
  "file1": "/path/to/file",
  "dir1": "/path/to/dir",
  "dir2": {
    "location": "/path/to/dir",
    "listing": [
      {
        "type": "File",
        "basename": "foo.txt",
        "checksum": "sha256:ABC123"
      },
      {
        "type": "Directory",
        "basename": "bar"
        "listing": [
          {
            "type": "File",
            "basename": "baz.txt",
            "checksum": "md5:WTFBBQ42"
          }
        ]
      }
    ]
  },
  "dir3": {
    "basename": "fakedir",
    "listing": [
      {
        "type": "File",
        "location": "/path/to/foo.txt",
        "basename": "bar.txt",
        "checksum": "sha256:ABC123"
      },
    ]
  }
}

In the simple form, a File/Directory value is just a string - typically a local path or a URI. The object forms have the following fields:

  • File
    • type: Always "File"; optional at the top-level but required within directory listings
    • location: The file location - this is equivalent to the value in the simple form. May be absent if the file is within a listing as long as basename is specified.
    • basename: The name of the file relative to the containing directory. If the basename differs from the actual file name at the given location, the file must be localized with the given basename.
    • checksum: A checksum of the file using one of the approved algorithms. If specified, the checksum must be verified during localization.
  • Directory
    • type: Always "Directory"; optional at the top-level but required within directory listings
    • location: The directory location - this is equivalent to the value in the simple form. May be absent if the directory is within a listing as long as basename is specified.
    • basename: The name of the directory relative to the containing directory. If the basename differs from the actual directory name at the given location, the file must be localized with the given basename. If location is not specified, then basename and listing are required, and all files/directories in the listing must have a location that is an absolute path or URI.
    • listing: An array of files/subdirectories within the directory. May be nested to any degree.

Importantly, none of these fields will be exposed within WDL, so the runtime definition of File/Directory won't change.

Draft implementation: https://github.com/openwdl/wdl/tree/472-directory-listing

jdidion avatar Aug 10 '21 23:08 jdidion

This is a great idea @jdidion and is something I have been wondering about myself. One quick question I have here, would you consider exposing some of those properties within wdl proper. A File might only ever be assigned with a url, but once defined should it have attributes (checksum, basename, size even?).

patmagee avatar Aug 11 '21 03:08 patmagee

Well, remember that we already have a size function. If we want to expose any other attributes (which I don't think is necessary), we should also do so with functions.

jdidion avatar Aug 11 '21 15:08 jdidion

You are correct (My late night brain jumped a bit to far down the OOP path)

patmagee avatar Aug 11 '21 16:08 patmagee

If we are going for checksums might I jump in and prevent us from using SHA256? That is a cryptographic hash. Cryptographic hashes are designed to be slow, as to prevent brute-force attacks. That is overkill for a file. We just want a cyclic redundancy check to make sure the file is the same. In bioinformatics we use 50GB+ files quite often so it makes quite a difference whether a fast or slow hash is used.

May I suggest using XXHash? It is extremely fast. I already implemented XXHash into cromwell. As Java-bindings and Python bindings are available it is really no strain on engine developers.

QED (benchmarks with hyperfine):

$ du -h big2.fastq.gz
657M	big2.fastq.gz

Benchmark #1: md5sum big2.fastq.gz
  Time (mean ± σ):     939.8 ms ±  12.8 ms    [User: 891.8 ms, System: 48.0 ms]
  Range (min … max):   919.2 ms … 958.1 ms    10 runs

Benchmark #1: sha1sum big2.fastq.gz
  Time (mean ± σ):     925.2 ms ±  10.5 ms    [User: 878.7 ms, System: 46.4 ms]
  Range (min … max):   903.9 ms … 941.8 ms    10 runs
 
Benchmark #1: sha256sum big2.fastq.gz
  Time (mean ± σ):      2.322 s ±  0.024 s    [User: 2.273 s, System: 0.049 s]
  Range (min … max):    2.295 s …  2.365 s    10 runs

Benchmark #1: sha384sum big2.fastq.gz
  Time (mean ± σ):      1.596 s ±  0.008 s    [User: 1.552 s, System: 0.044 s]
  Range (min … max):    1.587 s …  1.614 s    10 runs

Benchmark #1: sha512sum big2.fastq.gz
  Time (mean ± σ):      1.611 s ±  0.014 s    [User: 1.573 s, System: 0.038 s]
  Range (min … max):    1.582 s …  1.632 s    10 runs

Benchmark #1: xxh32sum big2.fastq.gz
  Time (mean ± σ):     138.5 ms ±  11.1 ms    [User: 91.1 ms, System: 47.4 ms]
  Range (min … max):   127.2 ms … 150.0 ms    10 runs

Benchmark #1: xxh64sum big2.fastq.gz
  Time (mean ± σ):      99.1 ms ±   9.4 ms    [User: 47.8 ms, System: 51.3 ms]
  Range (min … max):    84.6 ms … 106.4 ms    10 runs

Benchmark #1: xxh128sum big2.fastq.gz
  Time (mean ± σ):      84.9 ms ±   8.6 ms    [User: 37.2 ms, System: 47.7 ms]
  Range (min … max):    72.6 ms …  91.5 ms    10 runs

xxh128sum is 12x faster than md5sum and 30x faster than sha256sum. I think sha512sum is a lot faster than sha256sum is due to hardware optimizations.

Also 64-bit and 128-bit hashes can be represented as 16-char and 32-char hex-strings which is much easier to type than 64-char hex-strings. (I prefer the 16-char ones, much easier to check for typos/copy-paste errors!).

rhpvorderman avatar Oct 08 '21 10:10 rhpvorderman

Hi all, just a comment here, I'd consider generalizing/weakening this to say that the File/Directory representation may be a JSON object with a location key and whatever else the engine may wish to include or interpret. Going beyond that may be too prescriptive of implementation details. For example, miniwdl's call cache just uses the filesystem mtimes instead of digests.

mlin avatar Mar 22 '23 20:03 mlin

@mlin Good point. Checksum shouldn't be required. But we can make a suggestion to use checksums or some other means of determining file equality. And to Ruben's point we can suggest checksum algorithms, but require that any specific algorithm be used.

jdidion avatar Mar 29 '23 21:03 jdidion