TPMCalculator icon indicating copy to clipboard operation
TPMCalculator copied to clipboard

*_genes.out duplicate genes

Open dodoflyy opened this issue 4 years ago • 3 comments

Hello, thanks for your good tool!
My problem is some genes are duplicated in *_genes.out result. I used hisat2 to align reads.

TPMCalculator -g ${GRCh38GTF} -b ${BamDir}/${Sample}.bam \
-c 150 -k gene_id -t transcript_id -q 60 -p -a

Then, some genes have multiple records in *_genes.out, for example "ENSG00000135406" and "ENSG00000235538"

# first 7 columns
ENSG00000135406.14#1    chr12   49293251        49293996        746     0       0
ENSG00000135406.14#2    chr12   49295146        49298685        3540    0       0

# ENSG00000235538
ENSG00000235538.3#1     chr6    163671576       163798848       127273  4       0.00366198
ENSG00000235538.3#2     chr6    163927121       164009979       82859   2       0.00281244
ENSG00000235538.3#3     chr6    164228330       164231609       3280    0       0

This is my GTF record:

$ grep "ENSG00000235538" gencode.v35.annotation.gtf | awk -v FS="\t" '$3!="exon" {$NF="";print}'
chr6 HAVANA gene 163671577 164231610 . + . 
chr6 HAVANA transcript 163671577 163760739 . + . 
chr6 HAVANA transcript 163671603 163716199 . + . 
chr6 HAVANA transcript 163671609 163774630 . + . 
chr6 HAVANA transcript 163671613 163774630 . + . 
chr6 HAVANA transcript 163671641 163774628 . + . 
chr6 HAVANA transcript 163671642 163773768 . + . 
chr6 HAVANA transcript 163671643 163760376 . + . 
chr6 HAVANA transcript 163671658 163760376 . + . 
chr6 HAVANA transcript 163671666 163774624 . + . 
chr6 HAVANA transcript 163671678 163774626 . + . 
chr6 HAVANA transcript 163703904 163759841 . + . 
chr6 HAVANA transcript 163748718 163760382 . + . 
chr6 HAVANA transcript 163759267 163798849 . + . 
chr6 HAVANA transcript 163927122 164009980 . + . 
chr6 HAVANA transcript 164228331 164231610 . + .

THis is bug?

dodoflyy avatar Aug 30 '21 09:08 dodoflyy

Hi, This is the normal way TPMCalculator quantify RNA-Seq abundance on copies for a same gene. If you look at the output, the third column is the starting coordinate of the gene. In your example, each gene copy starts in a different position. TPMCalculator uses #1, #2, #3 ... to identify the copies.

r78v10a07 avatar Aug 30 '21 12:08 r78v10a07

Then how TPMCalculator identify gene copies? Here my GTF record only the first row's feature is "gene" others are "transcript"
And I can't see anything special of last 2 transcript from attributes.

$ grep "ENSG00000235538" gencode.v35.annotation.gtf | awk -v FS="\t" '$3!="exon" {print $9}'
gene_id "ENSG00000235538.3"; gene_type "lncRNA"; gene_name "AL078602.1"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000195978.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000665613.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-201"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000518669.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000671100.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-202"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000522294.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000657614.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-203"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000507375.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000665405.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-204"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000505809.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000669147.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-205"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000506093.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000657157.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-206"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000512883.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000667749.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-207"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000506923.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000664207.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-208"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000509534.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000659903.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-209"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000508554.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000666400.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-210"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000521262.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000452944.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-211"; level 2; transcript_support_level "5"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000043020.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000669856.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-212"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000512193.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000657138.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-213"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000513259.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000659063.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-214"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000527573.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000654484.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-215"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000527575.1";

dodoflyy avatar Aug 31 '21 01:08 dodoflyy

We identify the copies using the genomic coordinates. If the transcripts of a same gene are in different genomic region and they don't overlap we mark that as a copy of the same gene

r78v10a07 avatar Aug 31 '21 11:08 r78v10a07