eager icon indicating copy to clipboard operation
eager copied to clipboard

Allow pipeline to map to multiple references in one command

Open roberta-davidson opened this issue 2 years ago • 9 comments

Is your feature request related to a problem? Please describe

Currently have to run the eager pipeline twice if I want to map the same data to the human reference genome and then to the rCRS mitochondrial reference genome.

Describe the solution you'd like

The ability to specify Human ref genome and mito ref genome, and run both mappings in the same command. (Not competitive mapping, just simultaneously).

Describe alternatives you've considered

Currently, I'm just running eager twice, but no other parameters or input information change so this is redundant.

roberta-davidson avatar Apr 12 '22 04:04 roberta-davidson

Something similar has in fact been asked a VERY long time ago: https://github.com/nf-core/eager/issues/21

(issue 21... holy crap...)

So definitely I something I will consider for eager3, but this will require a bit of re-configuring (however I've now got experience on this now in a different pipeline so should be possible :+1: ).

Out of curiousity, why would you not want to do competitve mapping in this case? Wouldn't you have a risk of mis-calling variants due to NUMTs?

jfy133 avatar Apr 12 '22 07:04 jfy133

Actually I didn't really consider competitive mapping, although it may also work well but I haven't tried it myself. At this stage I'm simply performing two mappings and being stringent with variant calling to compensate for potential NUMTs.

roberta-davidson avatar Apr 13 '22 02:04 roberta-davidson

Well, the full human reference genome by default includes the chrMT/MT entry, so theoretically you are already mapping to both in a competitative mapping fashion already.

So I guess my question is why:

At this stage I'm simply performing two mappings and being stringent with variant calling to compensate for potential NUMTs.

is not sufficient? What is the benefit of mapping separately (to beclear, probably nothing wrong with the approach, I don't work with eukaryotes myself [anymore])

jfy133 avatar Apr 13 '22 05:04 jfy133

What we see when mapping to the whole reference genome is that for whatever reason, bwa preferentially maps reads to NUMTs instead of the MT, so there are always gaps at certain sites in the mitochondrial genome coverage, no matter the sequencing depth, so it's become common practice just to map separately to rCRS mito reference separately.

roberta-davidson avatar Apr 13 '22 06:04 roberta-davidson

Huh, ok good to know!

I will think about this indeed then when converting the pipeline to DSL2 then!

jfy133 avatar Apr 13 '22 06:04 jfy133

Great, thanks!! :)

roberta-davidson avatar Apr 13 '22 06:04 roberta-davidson

Dev note: I think we will have to have the option of CSV and/or a straight fasta. The CSV would have to have paths to optional indices

jfy133 avatar Sep 10 '22 18:09 jfy133

The other problem with competitive mapping comes later when you want to do some variant calling: because multi mapping read get a very low MAPQ score, they are typically ignored by genotypers. That's part of the reason why I've implemented https://github.com/maxibor/adnamap

On Sat, Sep 10, 2022, 20:12 James A. Fellows Yates @.***> wrote:

Dev note: I think we will have to have the option of CSV and/or a straight fasta. The CSV would have to have paths to optional indices

— Reply to this email directly, view it on GitHub https://github.com/nf-core/eager/issues/878#issuecomment-1242779513, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB3ENI27AOF3ZCWY4HEJ5TV5TFPJANCNFSM5TFJAMDA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

maxibor avatar Sep 11 '22 18:09 maxibor

that's also something I partly considered: the competitive mapping is used for very specific contexts (e.g. strain discrimination during pathogen screening), and like you said sort of breaks some assumptions with standard aligners.

It an still be done if you manually combine your FASTAs and use e.g. bedtools with the FASTA headers of each genome as the 'feature' to extract read counts, which hsould be sufficient.

That's why I prefer just allowing independent mapping as implemented in adnamap

jfy133 avatar Sep 12 '22 06:09 jfy133

Added for eager3 in https://github.com/nf-core/eager/pull/952

jfy133 avatar May 26 '23 08:05 jfy133