modules
modules copied to clipboard
BWA_MEM and BWAMEM2_MEM pass too many indexs
Description of the bug
When using these modules with aws-igenomes the mem fails, because the below lines pick up multiple indexes
INDEX=`find -L ./ -name "*.amb" | sed 's/.amb//'`
This line was introduced by @drpatelh
Some more background
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.sa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rbwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rpac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rsa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.sa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.sa
Note the version0.6.0 and version0.5.x
Steps to reproduce
Steps to reproduce the behaviour:
I don't really have good steps to reproduce inside of modules right now.
The way they do it in the GATK4 pipelines is to pass the reference.fasta file name as a string. We could pass the value as an additional entry, or have that in the meta map.
cf: https://github.com/gatk-workflows/gatk4-data-processing/blob/944403438198d9981adc4d244a3e8b70c4bf774a/processing-for-variant-discovery-gatk4.wdl#L319
Ok. Deleted the last comment because it was getting messy but I see the problem now...unless we default to find the first instance of the index in that structure? Is that possible? Realistically, nesting indices like that is the actual problem here and should be avoided in my opinion...is this only a problem for some genomes?
I'm sure we can hack the find line., but wouldn't it be easier and less hacky to have an extra value?
We could but how would having an extra fasta file being staged here help? We would still have the same issues with paths no? I have to say the biggest advantage of passing around and using find is that you can provide the entire index in a directory (tarred or untarred) and the pipeline will deal with it appropriately without having to worry about genome fasta prefixes? e.g. here
don't want to stage an extra fasta file, just the name
Yup, but that name would still be found 3 times in the example structure above?
No, because I would only pass BWAIndex/genome.fa.
To be honest I would only pass genome.fa, and use {} to cover all extension from the indices... but that's another debate...
Ok. Then I think we will still need some sort of hack in order to use tarred directories containing indices? That is where the find is the most useful because it finds the appropriate files for you. I agree we need to do something about this though.
I'm guessing the UNTAR_STAR_INDEX process (or any similar) could give out the indices, that we can collect to use as only one channel, and the value for the genome.fa or whatever string it should be.
So it could work
I think it could be handled in the prepare genome workflow. We could also just throw a cut in there to take the first genome it finds.
What do you think of something like that: https://github.com/MaxUlysse/nf-core_sarek/commit/35956ab3be9e09fea2142f7428a03ce8b4d30198
I think that could work.
To be honest though I don't know why the INDEX was introduced, so if someone could link that discussion I might be more help.
I think that could work.
To be honest though I don't know why the
INDEXwas introduced, so if someone could link that discussion I might be more help.
I think it was @drpatelh
Yep, I did it without permission because it made sense at the time and it solved quite a few problems with passing the index around and standardising the use of compressed directories too. To be fair, the only edge case is the one you pointed out @Emiller88 where you can have different indices nested in the same directory - I suspect this scenario is going to be extremely rare and if so then using the index in the top-level directory is a good shout or the user just re-organises their indices 🤷🏽
As we have discussed at length already, passing the fasta file in addition to the index is not a satisfactory solution because users can also call their custom indices whatever they want which means using the base name of the fasta isn't sufficient. This is why using a find made sense to deal with whatever people chose to call the index files.
I will have another play with this whilst try to re-prep the rnaseq pipeline for a release.
I think we can have a bwaindex.first.basename() or something like that to get the actual fasta name, and add that to the map in case of user specified indices
Ok. So if you are able to get a working prototype together where we don't have to pass the fasta file to the alignment process and where we can use .tar.gz directories for the index then that would be great @MaxUlysse
I'll look into it.
OK, so I have https://github.com/nf-core/modules/pull/1222
I think it should work with the .tar.gz extraction if we change the output from "dir" to "dir/*" as I did in the index processes
cc #627