miniwdl
miniwdl copied to clipboard
Index files may not localize alongside data files when basenames collide
The task is as follows: (original file here: https://github.com/biowdl/tasks/blob/3a01f3b414de6be3e686e6e1db68d80bdab52b73/gatk.wdl#L441)
task CombineGVCFs {
input {
Array[File]+ gvcfFiles
Array[File]+ gvcfFilesIndex
Array[File] intervals = []
String outputPath
File referenceFasta
File referenceFastaDict
File referenceFastaFai
String javaXmx = "4G"
String memory = "5G"
Int timeMinutes = 1 + ceil(size(gvcfFiles, "G") * 8)
String dockerImage = "quay.io/biocontainers/gatk4:4.1.8.0--py38h37ae868_0"
}
command {
set -e
mkdir -p "$(dirname ~{outputPath})"
gatk --java-options '-Xmx~{javaXmx} -XX:ParallelGCThreads=1' \
CombineGVCFs \
-R ~{referenceFasta} \
-O ~{outputPath} \
-V ~{sep=' -V ' gvcfFiles} \
~{true='-L' false='' length(intervals) > 0} ~{sep=' -L ' intervals}
}
...
GATK crashes because the variant files have no proper indexes. This is simply not true. So I checked the filesystem generated:
/mnt/miniwdl_task_container/work/_miniwdl_inputs
/mnt/miniwdl_task_container/work/_miniwdl_inputs/11
/mnt/miniwdl_task_container/work/_miniwdl_inputs/11/scatter-0.bed.g.vcf.gz
/mnt/miniwdl_task_container/work/_miniwdl_inputs/13
/mnt/miniwdl_task_container/work/_miniwdl_inputs/13/scatter-0.bed.g.vcf.gz.tbi
/mnt/miniwdl_task_container/work/_miniwdl_inputs/12
/mnt/miniwdl_task_container/work/_miniwdl_inputs/12/scatter-1.bed.g.vcf.gz
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-2.bed.g.vcf.gz
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-1.bed.g.vcf.gz
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-1.bed.g.vcf.gz.tbi
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/reference.dict
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-2.bed.g.vcf.gz.tbi
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/reference.fasta
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/reference.fasta.fai
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-0.bed.g.vcf.gz.tbi
/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-0.bed.g.vcf.gz
/mnt/miniwdl_task_container/work/_miniwdl_inputs/14
/mnt/miniwdl_task_container/work/_miniwdl_inputs/14/scatter-2.bed.g.vcf.gz
/mnt/miniwdl_task_container/work/_miniwdl_inputs/9
/mnt/miniwdl_task_container/work/_miniwdl_inputs/9/scatter-2.bed.g.vcf.gz.tbi
/mnt/miniwdl_task_container/work/_miniwdl_inputs/15
/mnt/miniwdl_task_container/work/_miniwdl_inputs/15/scatter-1.bed.g.vcf.gz.tbi
So interestingly, the fasta, fasta index and dictionary are correctly put in the same folder. However the scatter vcfs and their indexes are on the filesystem in no logical structure. Please note that all our BioWDL tasks do create vcfs with indexes always in the same directory. Cromwell then also locallizes them in the same directory.
The task that does the variant calling also creates the index (so they should always be co-localized). This happens in a subworkflow. This subworkflow has the both the vcf files as the indexes as an output. Checking the debug logs I find these paths:
/home/rhpvorderman/PycharmProjects/germline-DNA/20220808_113222_Germline/call-singleSampleCalling-1/out/vcfScatters/2/scatter-2.bed.g.vcf.gz
/home/rhpvorderman/PycharmProjects/germline-DNA/20220808_113222_Germline/call-singleSampleCalling-1/out/vcfIndexScatters/2/scatter-2.bed.g.vcf.gz.tbi
/home/rhpvorderman/PycharmProjects/germline-DNA/20220808_113222_Germline/call-singleSampleCalling-1/out/vcfScatters/0/scatter-0.bed.g.vcf.gz
/home/rhpvorderman/PycharmProjects/germline-DNA/20220808_113222_Germline/call-singleSampleCalling-1/out/vcfIndexScatters/0/scatter-0.bed.g.vcf.gz.tbi
So miniwdl localizes the files in a different directory despite hem originating from the same directory. This happens in the subworkflow output phase. Once broken, this localization cannot be fixed anymore and leads to the above result.
This could be fixed using structs. But structs are unwieldy and we moved away from them for that reason. (Having the referenceFasta, referenceFastaIndex, and referenceFastaDict inputs is much more understandable than having some "Reference" input that people have to look up since it is actually a struct).
The core issue seems to be here that miniwdl wants it workflow outputs from the same array to reside in the same folder. Which is nice for representing the end result, but results in localization errors when the workflow is used as a sub-workflow.
The issue is also related to all the filename collisions. If, as a workaround, you ensure all the gvcf files have distinct basenames (and matching indexes of course), then they'll all be localized into one directory. Then we can work on making the basename deconfliction mechanism look harder for common path prefixes amongst different inputs. This is what generates all the numbered subfolders under _miniwdl_inputs
.
Huh, I didn't see that, but you are right:
2022-08-08 11:33:53.283 wdl.w:Germline.w:call-JointGenotyping.t:call-gatherGvcfs input :: name: "gvcfFilesIndex", value: ["/mnt/miniwdl_task_container/work/_miniwdl_inputs/13/scatter-0.bed.g.vcf.gz.tbi", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-1.bed.g.vcf.gz.tbi", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/9/scatter-2.bed.g.vcf.gz.tbi", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-0.bed.g.vcf.gz.tbi", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/15/scatter-1.bed.g.vcf.gz.tbi", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-2.bed.g.vcf.gz.tbi"]
2022-08-08 11:33:53.284 wdl.w:Germline.w:call-JointGenotyping.t:call-gatherGvcfs input :: name: "gvcfFiles", value: ["/mnt/miniwdl_task_container/work/_miniwdl_inputs/11/scatter-0.bed.g.vcf.gz", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/12/scatter-1.bed.g.vcf.gz", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/14/scatter-2.bed.g.vcf.gz", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-0.bed.g.vcf.gz", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-1.bed.g.vcf.gz", "/mnt/miniwdl_task_container/work/_miniwdl_inputs/0/scatter-2.bed.g.vcf.gz"]
That shouldn't happen. Okay I will debug further to see why the inputs are duplicated. That is very odd.
Solved. So sorry for bothering you with this, turned out to be our own mistake. There was indeed duplication and a simple prefixing of the sample name solved the issue. Another cromwellism bites the dust! Now I just need to take care of #582 and the whole workflow can run from beginning to end. Thanks again!
Actually I think this is a valid issue that miniwdl could handle better (as a lower-priority corner case). There is nothing in WDL requiring the files to have distinct basenames, even if most would agree that's a good idea. In the spec we have these two rules,
- Two input files with the same name must be located separately, to avoid name collision.
- Two input files that originated in the same storage directory must also be localized into the same directory for task execution
I didn't previously think carefully about what to do when both of these rules apply, as is the case with the index files here.
I didn't realize I was able to rely on the spec here. Thanks. BioWDL workflows are extensively modularized so we can reuse quite a bit of code. As a result we do tend to run into limitations of the spec and execution engines more often. When we started with cromwell in 2018 we also found a lot of bugs and incomplete implementations there. So finding some things here in miniwdl was not unexpected. It's still quite impressive how quickly these issues are solved!
EDIT: No pressure, the workaround was adding one word and a few symbols so we can adapt our workflows easily.