wdl icon indicating copy to clipboard operation
wdl copied to clipboard

File Bundles and Secondary / Accessory Files

Open illusional opened this issue 5 years ago • 21 comments

Dear WDL team!

I was hoping to source the community opinion on file bundles, or CWL's secondary files. This issue was spawned from: https://github.com/broadinstitute/cromwell/issues/2269 and also within the WDL forum where there was general support for accessory files.

My main input is that I'd like the associated files to be attached within a explicit type, and not annotated on a specific task. Even though it might be easy to determine at runtime whether the list of index files is different, I think it would be clearer to have a bundle type (that might sit alongside the same importing rules as structs).

CWL annotates each input with a list of secondaryFiles that uses a simple query syntax:

  1. If string begins with one or more caret ^ characters, for each caret, remove the last file extension from the path (the last period . and all following characters). If there are no file extensions, the path is unchanged.
  2. Append the remainder of the string to the end of the file path.

(They also allow for an expression that should resolve to a filename or an array of files)

My suggestions

To take with a grain salt, I've got a few suggestions on how this might be expression in WLD to get the ball rolling. I'll use a modified indexed BAM, that has three files ($base.bam, $base.bam.bai, $base.txt) to show how the examples hold up:

  1. Create a bundle that has an implicit base file type (like CWL), and references secondary files using this base file with a secondary files selector syntax, probably the same as CWL for consistency and as far as I know it does the job. If coerced into a File, it should just resolve to the base file (in the following case, a Bam).
bundle NamedModifiedIndexBam {
  bai = ".bai"
  txt = "^.txt"
}

OR (just two suggestions for the same thing, not that WDL should accept both)

bundle AnonymousModifiedIndexBam = [".bai", "^.txt"]
  1. Create a more explicit bundle that has a basename, and must explicitly reference each associated file. It might be a good idea to have a base property, that would be the resolver when passing to the command line, or potential coersion into a file. Benefits are it doesn't require the query language and makes it clearer.
bundle ExplicitModifiedIndexBam {
  bam = ".bam"	# or base = ".bam"
  bai = ".bam.bai"
  txt = ".txt"
}

Then you could reference them in the same way you do primitives or structs:

task my_task {
  ModifiedIndexBam bamFile
  NamedModifiedIndexBam namedBamFile
  ExplicitModifiedIndexBam explicitBamFile

  command {
    echo ${bamFile}             # :: $base.bam
    echo ${namedBamFile.base}   # :: $base.bam			
    echo $(explicitBamFile.bai) # :: $base.bam.bai
  }

  # example of outputs 
  output {
    ModifiedIndexBam bamOut = glob("output.bam")
    NamedModifiedIndexBam namedOut = glob("output.bam")
    ExplicitModifiedIndexBam explicitOut = glob("output")
  }
}

illusional avatar Feb 18 '19 00:02 illusional

I can definitely see some cases where something like this would be useful, indexes being one obvious example. I'm wondering whether it would not be an option to simply allow defining defaults in a struct, rather then introducing an entire new type? I'm geussing there was some reason this isn't allowed though, considering it is expressly not permitted.

Struct indexedBamFile {
    File bam
    File index = bam + ".bai"
    File txt = sub(bam, "\\.bam$", ".txt")
}

Or just to throw out another idea, adding an Implicit (or Inferred) block:

Struct indexedBamFile {
    File bam
    Implicit {
        File index = bam + ".bai"
        File txt = sub(bam, "\\.bam$", ".txt")
    }
}

Either way, I feel that the second option you provide here would be nicer. Giving ^ some special meaning seems like it might get rather confusing, expescially considering it already has a special meaning in regex.

DavyCats avatar Feb 18 '19 08:02 DavyCats

My first thought was to also latch on to the Struct concept somehow, although at the moment it doesn't map cleanly as @DavyCats points out.

We could potentially describe a mechanism for (de)localizing entire Struct objects (I don't think the spec currently allows for that) and then as @DavyCats describes, some scheme for pattern matching.

geoffjentry avatar Feb 18 '19 22:02 geoffjentry

I am all in favor of making it easier to keep track of accessory files but weary of introducing any new keywords/syntax. @DavyCats's first suggestion is what I would prefer most, seeing as it's simple, intuitive, customizable, and doesn't introduce new keywords/syntax. This functionality would have innumerable uses for other data types as well.

mwalker174 avatar Feb 18 '19 23:02 mwalker174

I like the Implicit keyword as I think it's immediately clear that you don't have to explicitly handle them - however it does mean new syntax. With either of those suggestions, is it enough to to just state the whole variable (of type indexedBamFile), or would you need to use the dot syntax.

eg:

task test {
  IndexedBamFile myBamBaiBundle
  command {
    baseCommand \
      --bam ${myBamBaiBundle} \ # or 
      --bam2 ${myBamBaiBundle.bam}
  }
}

On a similar note, can you construct structs in the output? The documentation doesn't have a clear example of output construction, for example can you do this?

struct PersonWithFile {
    String name
    File file
}

task test {
    ⋮
    output {
        PersonWithFile person1 = { "name": "Michael", file="/path/to/file" }
        PersonWithFile person2 = { "name": "Miguel", file=glob("*.txt")[0] }
    }
}

illusional avatar Feb 19 '19 00:02 illusional

I like @DavyCats first idea. It is explicit and therefore readable. It is also very easy to intuit what it means. And it looks nice. In conclusion, it is an example of good design.

I like the Implicit keyword as I think it's immediately clear that you don't have to explicitly handle them - however it does mean new syntax.

I hate implicit things. As you have to know/guess what they mean, which makes it harder to read the code. And most of the work is in maintaining your pipelines (AKA reading them), not so much in writing them. This Implicit keyword is different however. It is explicit in its meaning that stuff is derived implicitly. So calling it Implicit is a bit of a misnomer. But anyway, it adds an extra keyword and as shown by @DavyCats an extra keyword is not necessary for having this functionality. So I am against adding this keyword.

EDIT: On second thought. Inferred (as @DavyCats suggests) in structs would be okay with me. As I can imagine that it will be hard to parse the WDL without such a keyword in place. We also have to be a bit gentle to the cromwell developers.

rhpvorderman avatar Feb 20 '19 08:02 rhpvorderman

I'd personally avoid Implicit, but I think that's my bias as a Scala developer as implicit is a very hot button issue in the scala world.

I like the Inferred suggestion from @rhpvorderman

geoffjentry avatar Feb 20 '19 18:02 geoffjentry

Hey all, just stirring the conversation again, it seems most like the Inferred suggestion from @DavyCats, I think that means something like:

struct IndexedBamFile {
    File bam                                    # file.bam
    Inferred {
        File index = bam + ".bai"               # file.bam.bai
        File txt = sub(bam, "\\.bam$", ".txt")  # file.txt
    }
}

Do you envision you can drop this straight into a command or have to explicitly reference the base, something like:

  IndexedBam bamFile
  command {
    echo ${bamFile}
    # or
    echo ${bamFile.bam}
  }

And how would you construct this, like globbing the output:

task test {
    ⋮
    output {
        IndexedBam bamFile = { "bam": "myBamFile.bam" }
        # or
        IndexedBam bamFile2 = IndexedBam(bam="myBamFile.bam")

        # can you specifically address the index files, or should this cause an error
        IndexedBam bamFileExp = IndexedBam(bam="myBamFile.bam", txt="myBamFile.txt")
    }
}

Edit: I read the Governance#RFC document. After the community draws towards a consensus, can I request a Shepherd, or I'm able to write up a PR and then can I request a core member to review it?

illusional avatar Mar 28 '19 22:03 illusional

Since this would be an extension of Struct, I image you would still have to explicitly reference the base. Structs can hold multiple (non-inferred) values after all:

struct Income {
    Float revenue
    Float cost
    
    Inferred {
        Float profit = revenue - cost
    }    
}

And you would contruct it the same way as you would a struct: Icome(revenue=1.1,cost=1.2). I don't think there would be a problem with letting people explicitly set the inferred values, but I wouldn't mind either way.

DavyCats avatar Mar 29 '19 08:03 DavyCats

I concur with the general consensus of the inferred attribute, and I would stay away from any explicit concept of bundles. the struct framework is flexible enough to offer this type of extended functionality IMO. The one thing that may be a bit confusing, is the syntax of the Inferred block gives the implied feeling that the inferred attribute is not something you can explicitly set. But this just might be my own thoughts.

To me the inferred block almost feels like a default value. (in which case we could kill two birds with one stone, as its inevitable there will be a request for default values)

struct Income {
    Float revenue
    Float cost
    Float profit

    default {
        profit = revenue - cost
    }    
}

patmagee avatar Apr 01 '19 19:04 patmagee

Hey @patmagee, I thought I'd reply here instead of your comment (Struct literal Revisit: #297).

I think bootstrapping off the struct concept makes sense, but I'm not sure how you could construct the struct in the output, eg: return a BamBai pair, how this works with globs and coercible types, etc.

illusional avatar Apr 01 '19 21:04 illusional

@illusional at the moment that functionality is not supported by the language. As your question revels theres alot of considerations that we must sort out for it that to be possible

patmagee avatar Apr 01 '19 23:04 patmagee

A request from a CWL perspective: please make sure that any choice here is explicit that the primarily file and its secondaries must be co-localized into the same directory.

This would clear up a common challenge around colocation assumptions when doing WDL2CWL conversion/transpiling

mr-c avatar Dec 15 '21 10:12 mr-c

Looking over this now, I think this would be a really useful addition that for novice users would reduce a lot of confusion for genomics work when things like bam indexes don't colocalize with their bams (or occasionally are copied first and thus the various tools reject them b/c their creation time is before the time of the bam), or reference files aren't explicitly named beyond the fasta file but tasks will fail if they are missing (such as the package commonly used for bwa mem alignment references).

I think the concept of using a struct makes the most sense to me, BUT adding an "Inferred" section to the Struct definition makes more sense to me than providing defaults for variables in the Struct that were already defined. I like Inferred over Implicit as well, to keep that section more clear for users.

vortexing avatar Jan 03 '22 18:01 vortexing

After thinking about this a lot more, I feel like the Inferred or default syntax is not the right direction and is just syntactic sugar on concepts we already use in wdl.

I think what we have is realistically two separate issues that are intrinsically related but distinct at the same time

  1. How do we reduce boilerplate to prevent a user of wdl from having to always define every single secondary file
  2. How do we ensure file colocalization of distinct files

While 1) implies 2) in many cases, 2 does not imply 1. So I think separating these problems into two separate issues would help us solve a more general case that has plagued wdl users for a long time (ie colocalization).

Secondary Files

When it comes to secondary files, I think structs (with a minor tweak) provide the most flexible and extensible approach.

What I propose is that we allow expressions and assignments within structs. Values without an expression and assignment are required, while values with an assignment can be omitted when. This is a concept we already extensively use with the input blocks of tasks and workflows so we are not introducing anything new.

Additionally, I would allow the struct literals to be callable with any combination of "optional" values, or only the required values.

Finally I would introduce a new function resolve(File, String |Array[String]) which, given a file and one or more extensions will attempt to resolve to a file with the given extension relative to the passed in file.

This approach drastically reduces boilerplate, but allows complete flexibility when needed. Additionally it is not relying on any new concepts or syntax that does not already exist in the language.

For example


struct Sample {
  File bam
  File index = resolve(bam,"bai")
  File checksum = resolve(bam,["md5","sha"]
}


Sample foo = Sample { bam: "foo.bam" }
Sample bar = Sample { bam: "bar.bam", checksum: "checksums/bar.md5" }

Colocalization

And now for the other problem. I think we need a more general mechanism ton co-localize files. This has been a pain point for a long time. I have not though as throughly through this, but what about a syntax like the following

task align {
  input {
    group {
      BwaBundle bwa_bundle
      File other file
    }
    group {
     Sample sample
    }
  }
}

This would use a keyword, (here called group) to indicate to an engine that files should be co-localized. This allows the engine to still be flexible with where those files actually end up, and it extends beyond just structs.

It is also important to remember directory's are coming (likely on 1.2) and we will be working on a directory listing syntax as well which may accomplish some of what this would achieve

patmagee avatar Mar 11 '23 03:03 patmagee

A different approach we can take is ditching the "magic" of co-localization entirely and make this an explicit thing using the Directory type introduced in the development version and slotted for inclusion in 1.2.

I think there may have been an assumption that a Directory maps onto a literal Directory in a cloud or object store, but If we interpret it a bit more loosely as a virtual directory we can really empower a user.

What if we introduce either an engine function called create_directory or figure out the syntax of a Directory literal (I am more in favour of this). This could allow the ultimate flexible option for users which extends beyond just structs and tasks. Doubling down on this approach we can also define type coercions (or explicit to a directory as well, this would allow a function to have a type signature of: create_directory(Array[File|Directory]). we could extend this to add options: create_directory(Array[File|Directory],DirectoryOptions)

# Reserved options struct
DirectoryOptions options = DirectoryOptions {
  Boolean? keys_as_names
  Boolean? rewrite_collisions
  Boolean? flatten
}

create_directory(Map[String,X],DirectoryOptions)
create_directory(Array[X],DirectoryOptions)
create_directory(Pair[X,X],DirectoryOptions)
create_directory(Struct,DirectoryOptions)
create_directory(File,DirectoryOptions)
create_directory(Directory,DirectoryOptions)
## Maps
Map[String,File]  map, = { "foo": "file.tar.gz", "bar": "file.idx"}

create_directory(map) =>
| - "file.tar.gz"
| - "file.idx"

create_directory(map,DirectoryOptions { keys_as_names : true } )
| - foo
| - bar

Map[String,Map[String,File]] map2 = { "foo": { "bar": "biz.txt" } }
create_directory(map2)
| - foo
   | - biz.txt

# Arrays
Array[File] array = ["foo.txt", "bar.txt"]
create_directory(array)
| - foo.txt
| - bar.txt

# Structs
Sample sample = Sample { bam: "sample.bam" }
create_directory(sample)
| - sample.bam
| - sample.bam.bai
| - sample.bam.md5

struct ExtendedSample {
  File meta_info
  Sample sample
}

ExtendedSample extended = ExtendedSample { meta_info: "meta.txt", sample: sample }
create_directory(extended)
| - "meta.txt"
| - sample
  | - sample.bam
  | - sample.bam.bai
  | - sample.bam.md5

create_directory(extended,DirectoryOptions { flatten : true } )

| - "meta.txt"
| - sample.bam
| - sample.bam.bai
| - sample.bam.md5

we could also consider some sort of directory listing literal as an alternative to the above

File file1,
File file2,
File file3
Array[File] files

Directory {
   file1,
   flatten sample, # flatten files in struct
   sample, # write the struct to a directory called sample
   sample as "sample_dir", # write files in struct to a new die
   file2 as "my_file.txt"
   "sub_folder" {
     file3,
     files
   }
}

patmagee avatar Mar 11 '23 15:03 patmagee

My two cents: secondary files become redundant when the colocalization problem is solved. Sure it is mildly annoying that you have to spell out all the index files in a BWA index, but this is a one time effort. Structs already solve this problem where you can bundle multiple files together in one struct and then use that struct as input.

Colocalization is the real problem. It mainly happens with bam or vcf files that are produced while the producing tasks does not also create the index. Our current solution in BioWDL is to always index in the producing tasks, but that is not always possible. We do have separate index steps but these are required to

  1. Always give back the index AND the original file, as otherwise there might be problems with colocalization
  2. Do weird stuff with soft or hard links in order to get original file and the index in the same directory. Whether this works properly is also implementation specific. (There are some differences between MiniWDL and Cromwell in this regard).

I feel the group keyword as described by @patmagee would solve this problem.


task VariantCall {
    input {
        group {
             File bam
             File bamIndex
        }
    }
}

That solves the real world problem we have now. Making sure bam and bamindex have the correct name is trivial in WDL because the basename and sub things work together well to correctly do this. With regex support.

As for the proposal for leveraging structs and using a resolve function. When working at biowdl we decided that using structs for simple indexes such as a Bam and an index or a VCF and an index is massive overkill. You require the user to create an extra level of nesting in the inputs JSON, which is very annoying, and it is just one file. For BWA and other aligner indexes we still do this. But I have to say that we recently converted to indexing the reference on the fly for users so they don't have to bother with typing everything out anymore. In short, it would be a nice to have but not critical.

group on the other hand is massively critical in that it would fix quite a lot of workarounds that we have now in our tasks, including some hacks to prevent errors on different implementations.

rhpvorderman avatar Mar 24 '23 05:03 rhpvorderman

@rhpvorderman I disagree that having to specify every file is not an issue. It's been one of the most common complaints I've heard about WDL over the years. Yes the colocalization is important, but there is also a desire to not need to specify everything.

geoffjentry avatar Mar 24 '23 13:03 geoffjentry

@geoffjentry Even if the colocalization makes indexing on the fly so easy that it becomes trivial to add a few indexing tasks to the workflow? In that case there is no need for the users to specify anything anymore. Or am I looking too much through my "developer of workflows" glasses now?

rhpvorderman avatar Mar 24 '23 13:03 rhpvorderman

@rhpvorderman I think it's more an over focus on the canonical "bam and it's index" example. Which yes is the most common example for a reason but not the only situation of "these files all go together". As an off the cuff example that happens to be top of mind, think of all the outputs from a DRAGEN build_hashtable (tho TBH that'd probably be better served as a Directory, but hey)

geoffjentry avatar Mar 24 '23 13:03 geoffjentry

@geoffjentry or the BWA index of course. I think leveraging structs and adding a function or other syntactic sugar like @patmagee proposed is the best option for that.

I am not too fond of this extra resolve function though

How about allowing defaults in structs to refer other members, something as simple as

struct BwaIndex = {
     File fasta
     File amb = fasta + ".amb"
     File ann = fasta + ".ann" 
     File bwt = fasta + ".bwt"
     File pac = fasta + ".pac"
     File sa = fasta + ".sa"
     File? alt = fasta + ".alt"  # This is the tricky one. How should the engine react when it is not there? Implicitly set it to None?
}

EDIT: I see @DavyCats already proposed this here: https://github.com/openwdl/wdl/issues/289#issuecomment-464640032

rhpvorderman avatar Mar 24 '23 13:03 rhpvorderman

Even with structs being used to bundle related files and co-localization being solved (somehow) there is a problem where a reader of the code or a linter cannot tell if File amb = fasta + ".amb" refers to some input which is expected to be there but never explicitly referenced or if it is an actual unused variable that should be cleaned up. For that reason I think there does need to be some mechanism to mark a variable to indicate that I really do mean to have this here even though it's not obvious why.

The hoops we have to jump through because genomics applications don't directly reference all of their inputs 🙄

markjschreiber avatar Feb 07 '24 18:02 markjschreiber