Use --mix with or without --rf ...?
Hi!
Thanks for the great tool, and continued updates over the years.
I noticed that the manual does not mention the effect of --rf when using --mix.
Does the --rf flag only affect the interpretation of the short read alignments, or also the long read alignments?
Is the official recommendation to not use --rf when using --mix?
If --rf is not intended to be used with --mix, then I wonder if this was affecting the following issues:
https://github.com/gpertea/stringtie/issues/406
https://github.com/gpertea/stringtie/issues/420
I look forward to your response!
--John
For what its worth, I can say the following two commands gave at least slightly different results on the same datasets:
stringtie --rf -p ${THREADS} -l ${BASE} -o stranded.gtf --mix ${ILMNALNS} ${ISOALNS}
stringtie -p ${THREADS} -l ${BASE} -o unstranded.gtf --mix ${ILMNALNS} ${ISOALNS}
wc -l */*gtf
688954 stranded.gtf
687753 unstranded.gtf
for f in *gtf ; do N=$( awk '$3=="transcript"' $f | wc -l ) ; echo -e "${f}\t${N}" ; done
stranded.gtf 81688
unstranded.gtf 81299
I haven't done a deeper comparison than that yet.
I used gffread to extract transcript sequences and borf to predict protein sequences.
I figured if one way of doing it was "wrong" it would be obvious by comparing protein sequence predictions -- one would have far fewer or far shorter proteins.
But the gave similar results in this regard.
Counts - similar number of proteins:
## STRANDED
grep -c stranded.transcripts.fasta
81688
grep -c ">" stranded.transcripts.pep.fasta
70610
## UNSTRANDED
grep -c ">" unstranded.transcripts.fasta
81299
grep -c ">" unstranded.transcripts.pep.fasta
70672
Lengths - very similar length distributions:
## STAT STRANDED UNSTRANDE
N 70610 70672
Sum 31033013 31400637
Max 16313 16313
Min 50 50
Avg 439.5 444.3
StD 566.8 595.8
Med 300 302
MAD 196 197
Q10 71 72
Q25 127 129
Q50 300 302
Q75 562 564
Q90 915 919
N50 684 690
L50 12692 12609
E-size 1170.4 1243.3
And borf considers a similar number of the transcript/protein sequences to be "complete":
awk '$(NF)=="complete"' stranded.transcripts.txt | wc -l
59154
awk '$(NF)=="complete"' unstranded.transcripts.txt | wc -l
59408
And a similar number of the protein seqs start with Met (not an alternative AA):
awk '$(NF-2)=="M"' stranded.transcripts.txt | wc -l
66695
awk '$(NF-2)=="M"' unstranded.transcripts.txt | wc -l
66827
Overall, it seems they performed similarly, but the "unstranded" may have performed slightly better as it has:
- slightly more transcripts
- slightly more proteins
- slightly longer proteins
- slightly more proteins considered "complete"
- slightly more proteins that start with Met
Having said that, I am currently BLASTing the "proteomes" against each other.
Up to 82% of proteins in the unstranded "proteome" have full-length hits in the stranded "proteome". So there there is plenty of agreement - although I'd need to dig a little deeper here to see if its full-length in both directions, at the same loci, etc.
The more concerning result, though, is that there is up to 17-18% of proteins that are either shorter or longer than their best hit, or have dissimilar sequence, or dont have a clear hit or clear hit in the same genomic location, or differ in some other way.
Thus, w/o further inspection, it is actually not clear which to choose as these "edge cases" might be exactly where using --rf or not makes an important difference.
@JohnUrban #467 Do you know when do I use which tag - (--rf vs --fr) based on the STAR output in ReadsPerGene file?
High counts in Column 3 (forward) suggest a stranded forward library (e.g., Illumina's "fr-firststrand" protocol like TruSeq Stranded mRNA). And, high counts in Column 4 (reverse) indicate a stranded reverse library (e.g., dUTP-based protocols like TruSeq Stranded Total RNA).