modules icon indicating copy to clipboard operation
modules copied to clipboard

BWA_MEM and BWAMEM2_MEM pass too many indexs

Open edmundmiller opened this issue 4 years ago • 19 comments

Description of the bug

When using these modules with aws-igenomes the mem fails, because the below lines pick up multiple indexes

INDEX=`find -L ./ -name "*.amb" | sed 's/.amb//'`

This line was introduced by @drpatelh

Some more background

s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.sa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rbwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rpac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rsa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.sa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.sa

Note the version0.6.0 and version0.5.x

Steps to reproduce

Steps to reproduce the behaviour:

I don't really have good steps to reproduce inside of modules right now.

edmundmiller avatar Mar 08 '21 17:03 edmundmiller

The way they do it in the GATK4 pipelines is to pass the reference.fasta file name as a string. We could pass the value as an additional entry, or have that in the meta map.

cf: https://github.com/gatk-workflows/gatk4-data-processing/blob/944403438198d9981adc4d244a3e8b70c4bf774a/processing-for-variant-discovery-gatk4.wdl#L319

maxulysse avatar Mar 09 '21 08:03 maxulysse

Ok. Deleted the last comment because it was getting messy but I see the problem now...unless we default to find the first instance of the index in that structure? Is that possible? Realistically, nesting indices like that is the actual problem here and should be avoided in my opinion...is this only a problem for some genomes?

drpatelh avatar Mar 09 '21 09:03 drpatelh

I'm sure we can hack the find line., but wouldn't it be easier and less hacky to have an extra value?

maxulysse avatar Mar 09 '21 09:03 maxulysse

We could but how would having an extra fasta file being staged here help? We would still have the same issues with paths no? I have to say the biggest advantage of passing around and using find is that you can provide the entire index in a directory (tarred or untarred) and the pipeline will deal with it appropriately without having to worry about genome fasta prefixes? e.g. here

drpatelh avatar Mar 09 '21 10:03 drpatelh

don't want to stage an extra fasta file, just the name

maxulysse avatar Mar 09 '21 10:03 maxulysse

Yup, but that name would still be found 3 times in the example structure above?

drpatelh avatar Mar 09 '21 10:03 drpatelh

No, because I would only pass BWAIndex/genome.fa.

To be honest I would only pass genome.fa, and use {} to cover all extension from the indices... but that's another debate...

maxulysse avatar Mar 09 '21 10:03 maxulysse

Ok. Then I think we will still need some sort of hack in order to use tarred directories containing indices? That is where the find is the most useful because it finds the appropriate files for you. I agree we need to do something about this though.

drpatelh avatar Mar 09 '21 11:03 drpatelh

I'm guessing the UNTAR_STAR_INDEX process (or any similar) could give out the indices, that we can collect to use as only one channel, and the value for the genome.fa or whatever string it should be. So it could work

maxulysse avatar Mar 09 '21 12:03 maxulysse

I think it could be handled in the prepare genome workflow. We could also just throw a cut in there to take the first genome it finds.

edmundmiller avatar Mar 09 '21 14:03 edmundmiller

What do you think of something like that: https://github.com/MaxUlysse/nf-core_sarek/commit/35956ab3be9e09fea2142f7428a03ce8b4d30198

maxulysse avatar Mar 15 '21 16:03 maxulysse

I think that could work.

To be honest though I don't know why the INDEX was introduced, so if someone could link that discussion I might be more help.

edmundmiller avatar Mar 16 '21 20:03 edmundmiller

I think that could work.

To be honest though I don't know why the INDEX was introduced, so if someone could link that discussion I might be more help.

I think it was @drpatelh

maxulysse avatar Mar 17 '21 09:03 maxulysse

Yep, I did it without permission because it made sense at the time and it solved quite a few problems with passing the index around and standardising the use of compressed directories too. To be fair, the only edge case is the one you pointed out @Emiller88 where you can have different indices nested in the same directory - I suspect this scenario is going to be extremely rare and if so then using the index in the top-level directory is a good shout or the user just re-organises their indices 🤷🏽

As we have discussed at length already, passing the fasta file in addition to the index is not a satisfactory solution because users can also call their custom indices whatever they want which means using the base name of the fasta isn't sufficient. This is why using a find made sense to deal with whatever people chose to call the index files.

I will have another play with this whilst try to re-prep the rnaseq pipeline for a release.

drpatelh avatar Mar 17 '21 09:03 drpatelh

I think we can have a bwaindex.first.basename() or something like that to get the actual fasta name, and add that to the map in case of user specified indices

maxulysse avatar Mar 17 '21 09:03 maxulysse

Ok. So if you are able to get a working prototype together where we don't have to pass the fasta file to the alignment process and where we can use .tar.gz directories for the index then that would be great @MaxUlysse

drpatelh avatar Mar 17 '21 09:03 drpatelh

I'll look into it.

maxulysse avatar Mar 17 '21 10:03 maxulysse

OK, so I have https://github.com/nf-core/modules/pull/1222 I think it should work with the .tar.gz extraction if we change the output from "dir" to "dir/*" as I did in the index processes

maxulysse avatar Jan 20 '22 15:01 maxulysse

cc #627

maxulysse avatar Apr 19 '22 14:04 maxulysse