Whippet.jl icon indicating copy to clipboard operation
Whippet.jl copied to clipboard

Splice junctions strandedness

Open PeterVenhuizen opened this issue 3 years ago • 1 comments

I've run into an issue with the likely misclassification of the strandedness of certain splice junctions. I'm working with a strand-specific RNA-Seq library for A. thaliana and generated a BAM supplemented Whippet index using the STAR alignments of said library.

A. thaliana has various cases of overlapping genes on opposing strands, as can be seen in the IGV screenshot below. For this particular case, the STAR alignment identified what I presume to be the correct junctions for the AT4G36050 (on the - strand) and the AT4G36052 (on the + strand) genes. When I look for the specific junction (chr4 17052585 17052653) in the STAR SJ.out.tab, STAR reports it as a junction on the positive strand (which seems to be correct, seeing as it corresponds to the AT4G36052 gene on the plus strand). However, the same junction in the Whippet .jnc.gz output file is reported to be on the minus strand. Subsequently, an RI event using this junction is annotated by Whippet for AT4G36050 and for these particular samples gives a very strong dPSI of >0.6. However, I do believe this is an error, given the origin of the splice junction.

whippet_strandedness_issue

I would like to know how Whippet handles overlapping genes when building the (supplemented) index and when quantifying. Is there any way for me to parse out those events, or potentially fix their quantification?

I've used julia version 0.6.4 and whippet v0.11.1 for my analysis. The --bam-both-novel flag was used when generating the index.

Thanks

PeterVenhuizen avatar Jul 06 '20 08:07 PeterVenhuizen

Hi @PeterVenhuizen, if you look in src/bam.jl you'll see it's a pretty simple bit of code that's filtering reads for each gene--

isspliced( rec::BAM.Record ) = ismatch(r"N", BAM.cigar(rec))
strandpos( rec::BAM.Record ) = BAM.flag(rec) & 0x010 == 0

function process_records!( reader::BAM.Reader, seqname::String, range::UnitRange{Int64}, strand::Bool,
                           exons::CoordTree, known, oneknown::Bool,
                           novelacc::Dict{K,V}, noveldon::Dict{K,V} ) where {K,V}

...

         # if is spliced process splice sites
         if isspliced(rec) && strand == strandpos(rec)
            known = process_spliced_record!( novelacc, noveldon, rec, known, oneknown )
         end

So the strand encoded in the fifth bit of the BAM flag (0x10) has to match the strand of the gene, where 0 is + and 1 is - in both. That's not to say there isn't potentially some other bug going on here somewhere, but it would be good if you could give a minimally sized example with reproducible inputs.

On another front, my initial benchmarking (in the paper supplement) suggests that using the --bam-both-novel flag increases the FDR considerably, so I wouldn't recommend this in general. Requiring one of the splice sites to match a known splice-site in the gene is a reasonable approach I think-- also good to note here that the set of "known" splice sites in a gene expands iteratively as new splice junctions are added.

timbitz avatar Jul 24 '20 18:07 timbitz