salmon
salmon copied to clipboard
Gene fusions
Wicked fast indeed! Are there any plans to extend salmon to also detect gene fusion events? There isn't a fast and accurate way to do that yet, only approaches requiring full alignments. Most often a base-perfect breakpoint isn't required, an estimate within a hash length is fine. We are a heavy user of bcbio and are also running the full STAR alignment just for gene fusions, which really sucks. Any ideas would be much appreciated.
Hi @schelhorn,
Yes; we are actively looking at fusion prediction based on quasi-mapping. The initial results are promising, but we're still working on improving and refining the method. I'll be sure to let you know when we have something that is ready to test :).
Best, Rob
Excellent. May I point out that tools such as Oncofuse https://github.com/mikessh/oncofuse/ and Pegasus https://github.com/RabadanLab/Pegasus have a particular, additional value since they provide functional annotation of fusion events identified by other approaches? Also, these resources may prove helpful wrt validation data: https://github.com/chapmanb/bcbio-nextgen/issues/210 and http://m.genome.cshlp.org/content/early/2015/11/10/gr.186114.114 Adding @roryk here for highlighting this feature request in bcbio.
Awesome; thanks for the pointers! We'll definitely take a look at these.
Hello @rob-p, may I ask whether there are any news concerning gene fusion detection in Salmon?
Hi @schelhorn,
Yes, we have built a pipeline atop salmon and quasi-mapping. At this point, what we see is that it is very fast with high sensitivity. Our main focus has been on improving the specificity, which is current better than some, but not all methods. I realize, of course, that false-positives are a very difficult (and key) problem in this domain, so I'd really like to make sure they are well-handled.
Great; would you like help testing the pipeline, and integrating it into bcbio? We could help with both :)
Also, do you know if the Salmon pseudo-BAM is suitable for fusion calling by standard (alignment-based) fusion calling tools, ie does the BAM include information on mate pairs mapped across transcripts, or reads spanning breakpoints?
Hi @schelhorn,
Sorry for the uncharacteristically slow response on this. We're going full steam ahead for the RECOMB deadline, so I've been less responsive than usual. Anyway, I've invited you to the repository for the fusion project (it's currently private). Feel free to poke around, but it's probably not useful until we can send you a short writeup describing the current pipeline (since things are still very "alpha"). Regarding calling fusions from the sam output of Salmon, one can't do this directly because there are, by default, no encompassing reads (i.e. individual reads split between transcripts) and, to improve abundance estimation, salmon is conservative with it's use of spanning reads. However, we can get at this information from quasi-mapping, so I can definitely consider adding some flags to provide this info (this is the type of thing we output in the fusion pipeline currently, and then we have to postprocess it).
Excellent; thank you. We'll have a look and see what we can contribute.
Hello @rob-p, could you please invite @tetianakh to the repo as well? She'll do the development on our end. Thanks!
Hi @schelhorn,
Sure, I'll had her now. We'll send you a small write-up about the state of the codebase and how to run the current pipeline next week (once my student is back from the current CSHL meeting with all of the cool kids ;P).
Sweet!
Hi Rob,
Could I get in on this? We have a couple projects needing to call fusions on a large amount of samples, and it would be great to have something speedy to iterate on.
FYI, I also asked in the kallisto project: https://github.com/pachterlab/kallisto/issues/122
Hi @rob-p, I haven't received an invitation to the private repo. Could you please invite me? Thanks!
Hi @tetianakh, I've re-sent the invitation. If you don't get it, please send me an e-mail, and I'll reply with the link to join directly.
Thanks, I've received it now.
Great :). I'll have @hiraksarkar write up a brief overview of the current state of the codebase (including which branch contains the latest stuff) this week. We can either share that information in the issues over at that repo, or we can e-mail you the write-up @schelhorn, @tetianakh and @roryk. Let me know if one method is preferable to the other.
Great; directly in the repo is preferred.
This sounds cool. Have you looked at submitting your method for the DREAM RNA-Seq analysis challenge ( https://synapse.org/SMC_RNA ) ?
And any status updates? I'd be interested to test drive a quasi-mapping-based fusion caller!
One fast way using pseudo-alignments should be Kallisto+[Manta|Pizzly], but I haven't tried that myself. We decided to go with full transcriptome alignments instead and integrated EricScript into bcbio. We'd still be interested in something more modern, though.
If one has a downstream fusion pipeline that uses transcriptome mapping, you can already get those from the -z=<output.sam> option for a while. The real challenge is how to properly control the false positive rate. That's the main thing special purpose downstream software must solve.
Thanks for the tips; I'll experiment.
Hi @rob-p, We are working towards creating fusion calling pipeline based on Salmon/Pizzly. It would be helpful to see the current state of the repository and try to replicate some of the experiments we have done with it. We seem to be hitting good specificity but lagging a bit short on sensitivity. Thanks, Prateek
Hello @rob-p! I was wondering if there have been any updates on the fusion/detection of spanning reads problem. I'm about to embark on a project to process many bacterial transcriptomes from many different genomes/species and plan to use salmon. I would love to be able to detect polycistronic transcripts through the identification of spanning reads.