transcripts with identical coordinates but different abundances

Open zjuezhen opened this issue 1 year ago • 4 comments

Hello,

Thanks so much for developing this tool, I find it quite easy to use.

I'm wondering what's causing duplicated transcripts (identical coordinates, different transcript id) to have different quantifications within the same sample? I see that the duplicated transcript is coming from using a reference gtf (human, GENCODE v37) that contains two transcripts with different transcript_id but the same genomic coordinate (chr1:154582076-154585043:154582076-154608204:154585217-154585865:154586181-154586363:154588125-154588258:154588551-154588673:154589369-154589462:154589757-154590409:154596805-154596995:154597123-154597267:154597828-154597976:154598402-154598585:154601041-154602626:154607992-154608204:-). Does this mean the reference gtf should be pre-filtered to contain only transcripts with unique genomic coordinates?

Looking at the coverage for the region chr1:154582076-154608204 in IGV seemed to suggest that the higher quantification (1st screenshot) is correct: igv_snapshot_ACTTGA_chr1_154582076-154608204_neg

The command line call was:

stringtie \
-G ref_annot.gtf \
--ref ref_genome.fa \
-o STRG_transcripts.${sampleID}.gtf \
-A STRG_gene_abundances.${sampleID}.tab 
-B --rf -t -c 1 -f 0.01 -M 0.95 -p 50 -v ${bamdir}/${sampleID}.sortedByCoord.out.bam

Any advice on this is greatly appreciated!

Best regards, Jenny

May 09 '24 20:05 zjuezhen