TPMCalculator
TPMCalculator copied to clipboard
*_genes.out duplicate genes
Hello, thanks for your good tool!
My problem is some genes are duplicated in *_genes.out result. I used hisat2 to align reads.
TPMCalculator -g ${GRCh38GTF} -b ${BamDir}/${Sample}.bam \
-c 150 -k gene_id -t transcript_id -q 60 -p -a
Then, some genes have multiple records in *_genes.out, for example "ENSG00000135406" and "ENSG00000235538"
# first 7 columns
ENSG00000135406.14#1 chr12 49293251 49293996 746 0 0
ENSG00000135406.14#2 chr12 49295146 49298685 3540 0 0
# ENSG00000235538
ENSG00000235538.3#1 chr6 163671576 163798848 127273 4 0.00366198
ENSG00000235538.3#2 chr6 163927121 164009979 82859 2 0.00281244
ENSG00000235538.3#3 chr6 164228330 164231609 3280 0 0
This is my GTF record:
$ grep "ENSG00000235538" gencode.v35.annotation.gtf | awk -v FS="\t" '$3!="exon" {$NF="";print}'
chr6 HAVANA gene 163671577 164231610 . + .
chr6 HAVANA transcript 163671577 163760739 . + .
chr6 HAVANA transcript 163671603 163716199 . + .
chr6 HAVANA transcript 163671609 163774630 . + .
chr6 HAVANA transcript 163671613 163774630 . + .
chr6 HAVANA transcript 163671641 163774628 . + .
chr6 HAVANA transcript 163671642 163773768 . + .
chr6 HAVANA transcript 163671643 163760376 . + .
chr6 HAVANA transcript 163671658 163760376 . + .
chr6 HAVANA transcript 163671666 163774624 . + .
chr6 HAVANA transcript 163671678 163774626 . + .
chr6 HAVANA transcript 163703904 163759841 . + .
chr6 HAVANA transcript 163748718 163760382 . + .
chr6 HAVANA transcript 163759267 163798849 . + .
chr6 HAVANA transcript 163927122 164009980 . + .
chr6 HAVANA transcript 164228331 164231610 . + .
THis is bug?
Hi, This is the normal way TPMCalculator quantify RNA-Seq abundance on copies for a same gene. If you look at the output, the third column is the starting coordinate of the gene. In your example, each gene copy starts in a different position. TPMCalculator uses #1, #2, #3 ... to identify the copies.
Then how TPMCalculator identify gene copies? Here my GTF record only the first row's feature is "gene" others are "transcript"
And I can't see anything special of last 2 transcript from attributes.
$ grep "ENSG00000235538" gencode.v35.annotation.gtf | awk -v FS="\t" '$3!="exon" {print $9}'
gene_id "ENSG00000235538.3"; gene_type "lncRNA"; gene_name "AL078602.1"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000195978.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000665613.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-201"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000518669.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000671100.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-202"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000522294.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000657614.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-203"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000507375.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000665405.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-204"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000505809.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000669147.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-205"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000506093.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000657157.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-206"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000512883.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000667749.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-207"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000506923.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000664207.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-208"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000509534.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000659903.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-209"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000508554.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000666400.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-210"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000521262.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000452944.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-211"; level 2; transcript_support_level "5"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000043020.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000669856.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-212"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000512193.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000657138.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-213"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000513259.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000659063.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-214"; level 2; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000527573.1";
gene_id "ENSG00000235538.3"; transcript_id "ENST00000654484.1"; gene_type "lncRNA"; gene_name "AL078602.1"; transcript_type "lncRNA"; transcript_name "AL078602.1-215"; level 2; tag "basic"; tag "TAGENE"; havana_gene "OTTHUMG00000195978.1"; havana_transcript "OTTHUMT00000527575.1";
We identify the copies using the genomic coordinates. If the transcripts of a same gene are in different genomic region and they don't overlap we mark that as a copy of the same gene