eager
eager copied to clipboard
Allow pipeline to map to multiple references in one command
Is your feature request related to a problem? Please describe
Currently have to run the eager pipeline twice if I want to map the same data to the human reference genome and then to the rCRS mitochondrial reference genome.
Describe the solution you'd like
The ability to specify Human ref genome and mito ref genome, and run both mappings in the same command. (Not competitive mapping, just simultaneously).
Describe alternatives you've considered
Currently, I'm just running eager twice, but no other parameters or input information change so this is redundant.
Something similar has in fact been asked a VERY long time ago: https://github.com/nf-core/eager/issues/21
(issue 21... holy crap...)
So definitely I something I will consider for eager3, but this will require a bit of re-configuring (however I've now got experience on this now in a different pipeline so should be possible :+1: ).
Out of curiousity, why would you not want to do competitve mapping in this case? Wouldn't you have a risk of mis-calling variants due to NUMTs?
Actually I didn't really consider competitive mapping, although it may also work well but I haven't tried it myself. At this stage I'm simply performing two mappings and being stringent with variant calling to compensate for potential NUMTs.
Well, the full human reference genome by default includes the chrMT/MT entry, so theoretically you are already mapping to both in a competitative mapping fashion already.
So I guess my question is why:
At this stage I'm simply performing two mappings and being stringent with variant calling to compensate for potential NUMTs.
is not sufficient? What is the benefit of mapping separately (to beclear, probably nothing wrong with the approach, I don't work with eukaryotes myself [anymore])
What we see when mapping to the whole reference genome is that for whatever reason, bwa preferentially maps reads to NUMTs instead of the MT, so there are always gaps at certain sites in the mitochondrial genome coverage, no matter the sequencing depth, so it's become common practice just to map separately to rCRS mito reference separately.
Huh, ok good to know!
I will think about this indeed then when converting the pipeline to DSL2 then!
Great, thanks!! :)
Dev note: I think we will have to have the option of CSV and/or a straight fasta. The CSV would have to have paths to optional indices
The other problem with competitive mapping comes later when you want to do some variant calling: because multi mapping read get a very low MAPQ score, they are typically ignored by genotypers. That's part of the reason why I've implemented https://github.com/maxibor/adnamap
On Sat, Sep 10, 2022, 20:12 James A. Fellows Yates @.***> wrote:
Dev note: I think we will have to have the option of CSV and/or a straight fasta. The CSV would have to have paths to optional indices
— Reply to this email directly, view it on GitHub https://github.com/nf-core/eager/issues/878#issuecomment-1242779513, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB3ENI27AOF3ZCWY4HEJ5TV5TFPJANCNFSM5TFJAMDA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
that's also something I partly considered: the competitive mapping is used for very specific contexts (e.g. strain discrimination during pathogen screening), and like you said sort of breaks some assumptions with standard aligners.
It an still be done if you manually combine your FASTAs and use e.g. bedtools with the FASTA headers of each genome as the 'feature' to extract read counts, which hsould be sufficient.
That's why I prefer just allowing independent mapping as implemented in adnamap
Added for eager3 in https://github.com/nf-core/eager/pull/952