Whippet.jl icon indicating copy to clipboard operation
Whippet.jl copied to clipboard

Merging bams for including novel splice sites makes a really big bam :(

Open aleighbrown opened this issue 4 years ago • 11 comments

Hi,

I have about 30 samples which I want to do a differential splicing analysis on.

The instructions for including unannotated splice sites suggests that I first merge all my bams into a single bam, and then put that bam into the index call. However, if I merge all my aligned bams, we're talking about a bam which is like 200G, which is just going to be deleted eventually anyhow.

Is there a fast way around this problem you could recommend, e.g. like build multiple indexes on each bam or something hacky like this?

If the bam files are being just to find unannotated splice junctions could one not use instead of the bam the SJ.out.tab files output by STAR, for example rather than needing the bam file itself?

Or could one use these SJ.out.tab files to construct a fake gtf including the novel splicing sites?

aleighbrown avatar Aug 12 '19 14:08 aleighbrown

Hi @aleighbrown, Yes this is a caveat of the current system. In theory, Whippet could probably use the SJ.out.tab file in addition to a GTF file (which is required because it has txStart and txEnd positions and also the known full isoforms) to add the novel splice sites, but this would take a tid bit of implementation...

timbitz avatar Aug 22 '19 18:08 timbitz

Just to quickly add to this-- with such a big bam file you're going to want to increase the --bam-min-reads flag to the indexer to something more appropriate (the default is 1!), otherwise the index created is going to contain one-off cryptic splice sites and perhaps even alignment errors as well.

timbitz avatar Aug 22 '19 18:08 timbitz

The other possibility here would be using Cufflinks or Stringtie to make a merged gtf that includes the novel transcripts found in the bams and then building a Whippets index off of that. Do you have any thoughts as to if that would work?

aleighbrown avatar Aug 23 '19 05:08 aleighbrown

I don't know-- That would be crossing into uncharted territory. I suppose you could try and test it by simulating reads from a GTF file that is downsampled by removing exons or alternative splice sites?

timbitz avatar Aug 29 '19 18:08 timbitz

Hi @aleighbrown aleighbrown and @timbitz timbitz recently I need to find novel junction from almost 30,000 samples. Any good advice on me please?

itszhengan avatar Nov 11 '19 14:11 itszhengan

@itszhengan, currently Whippet is not really designed to compare de novo splicing quantifications across large cohorts of samples like 30K. And if you were going to do that, I wouldn't build a single comprehensive index of all de novo splicing in a single merged bam file-- I would probably build one for each sample, and then compare the node structures or de novo junctions across the samples somehow (but this is not completely straight-forward as-is with overlapping nodes of CE/RI type, for example).

The only purpose of having a single index (as opposed to many) is to perform quantitative differential splicing analysis between two sets of samples-- but if the goal is to identify/study de novo splicing only, then I don't see how comparing splicing quantifications between two sets of samples, where one has an alternative splicing event and the other does not, really makes sense-- a qualitative comparison seems sufficient. The example in the documentation is more for analyzing various healthy tissues (or across poorly annotated species) where one should desire to build a single comprehensive index, where the annotation is lacking, to enable better quantitative comparisons.

I am still planning to make additions to Whippet specifically to analyze de novo splicing patterns across large cohorts of samples... but I haven't started yet, and am not sure when this will be available after I do.

timbitz avatar Nov 12 '19 22:11 timbitz

@timbitz Thank you for your reply.

itszhengan avatar Nov 13 '19 17:11 itszhengan

For what it's worth, I've had some success using MAJIQ on our data set, but that's only 30 + samples, caveat that their outputs are more difficult to interpret than Whippets... you could also try http://yeolab.github.io/outrigger/

aleighbrown avatar Nov 15 '19 11:11 aleighbrown

Thank you for this information!

Zheng An Administrative Assistant China-Japan Union Hospital of Jilin University

Anna-Leigh Brown [email protected] 于2019年11月15日周五 下午7:05写道:

For what it's worth, I've had some success using MAJIQ on our data set, but that's only 30 + samples, caveat that their outputs are more difficult to interpret than Whippets... you could also try http://yeolab.github.io/outrigger/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/timbitz/Whippet.jl/issues/87?email_source=notifications&email_token=ALY6KVDHUDET32X5Q6HBI4LQTZ67TA5CNFSM4ILBSCBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEFDO6I#issuecomment-554317689, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY6KVBHIZZT4IQH2HFDMG3QTZ67TANCNFSM4ILBSCBA .

itszhengan avatar Nov 18 '19 01:11 itszhengan

For what it's worth, I've had some success using MAJIQ on our data set, but that's only 30 + samples, caveat that their outputs are more difficult to interpret than Whippets... you could also try http://yeolab.github.io/outrigger/

Hi @aleighbrown aleighbrown have you tried "Use Cufflinks or Stringtie to make a merged gtf that includes the novel transcripts found in the bams"? I saw the gtf-reproducing action similar to what you said 5 years ago in this paper(https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0168-9) But I still don't know the justification and the potential impact. Several annotation-free tools have been proposed such as leafcutter because the lack of reference annotation file. However, those are not based AS events or don't consider intron retention. So I have to use the traditional annotation way to do AS event analysis. And it seems that Whippet is the only recent method that fits me. So do you have any suggestion?

itszhengan avatar Nov 26 '19 17:11 itszhengan

Might want to stop clogging this Github issue on this point since it's not Whippets issue per se; feel free to email or twitter DM on it

aleighbrown avatar Nov 26 '19 17:11 aleighbrown