TPMCalculator icon indicating copy to clipboard operation
TPMCalculator copied to clipboard

Input files and GeneID Errors

Open quimollo opened this issue 3 years ago • 5 comments

Hi! Thanks for developing such a useful tool! :) I've installed it with Bioconda in a linux server and I'm trying to use it to analyze the STAR BAM files from PE reads, but I got two errors.

When running the following command: nohup /localhome/icmm/kns601/.local/bin/anaconda3/bin/TPMCalculator -g /path/to/gencode.v35.annotation.gtf -d BAMwithMT -p -k gene_id &> nohup_TPMCounts_withMT.out &

I got this in the nohup report: nohup: ignoring input Reading GTF file ... Done in 286.83 seconds Parsing BAM files Processing sample: PT979-2_Aligned.sortedByCoord.out (NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL) Chromosome with name: GL000008.2 does not exist Chromosome with name: GL000008.2 does not exist Chromosome with name: GL000008.2 does not exist Chromosome with name: GL000008.2 does not exist Chromosome with name: GL000008.2 does not exist Chromosome with name: GL000008.2 does not exist ... Chromosome with name: KI270757.1 does not exist Chromosome with name: KI270757.1 does not exist Chromosome with name: KI270757.1 does not exist Chromosome with name: KI270757.1 does not exist Chromosome with name: KI270757.1 does not exist Chromosome with name: KI270757.1 does not exist in 2180.64 seconds. Processed 50949830 reads Processing sample: 2730-006_Aligned.sortedByCoord.outCan't open file BAMwithMT/2730-006_Aligned.sortedByCoord.out.bam

The first problem I had is that its taking the geneIDs as chromosomes, and it prompts the "does not exist" error. It seems that when analyzing the genes from GTF reference that are in the regular gene annotation it doesn't prompt any error, but when arriving to the ones that are outside that annotation gets the geneID as chromosome for some reson.

The other problem is that the following file is not read properly, as the BAMwithMT/2730-006_Aligned.sortedByCoord.out.bam exists in my folder and the program itself has printed the path as I was using the "-d" option.

Can you please help me solving this issues?

Thanks! Quim

quimollo avatar Nov 26 '20 19:11 quimollo

Hi, could you, please share a few reads from your BAM file and the GTF?

r78v10a07 avatar Nov 27 '20 15:11 r78v10a07

Sure!

BAM file: NB501508:524:HGT27BGXB:1:22103:1791:17763 99 chr1 14473 255 75M = 14641 244 GGCTGGGTGGAGCCGTCCCCCCATGGAGCCCAGGCAGACAGAAGTCCCCCCCCCAGCTGTGTGGCCTCAGGCCAG AAAAAEEEEEAEEEEEEEEEEA/EEA/EE/E/E/EAE/EEA/EEE/EEE/EEEEAEE/EEEEEEEEEE/EEEE/E NH:i:1 HI:i:1 AS:i:141 nM:i:4 NB501508:524:HGT27BGXB:4:11502:4299:14935 99 chr1 14626 255 45M1D31M = 14704 154 GTCCTGGCTGTGTCCATGTCAGAGCAATGGCCCAAGTCTGGGTCTGGGGGGAAGGTGTCATGGAGCCCCCTACGAT AAAAAEEEEAEEEEEEEEAEEEAEEEEEE/EEEEEAEE/6/EEEEEAEEEEEEEE/EEEEEEEA/EEEEEEEEEE6 NH:i:1 HI:i:1 AS:i:142 nM:i:2 NB501508:524:HGT27BGXB:2:11302:21613:15229 99 chr1 14637 255 76M = 14685 124 GTCCATGTCAGAGCAATGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGT AAAAAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEE<EEEAAE/EEEEEEEEEEE NH:i:1 HI:i:1 AS:i:148 nM:i:1 NB501508:524:HGT27BGXB:1:22103:1791:17763 147 chr1 14641 255 76M = 14473 -244 ATGTCAGAGCAATGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGTCCTC EEE/E/EEEE/E/EEEEEEEE/E6EEE/E/EEEEEEEEEEEEE/EA/EE/EEEEEEEEEEEEAEEEEE/EE/AAAA NH:i:1 HI:i:1 AS:i:141 nM:i:4

GTF file: ##description: evidence-based annotation of the human genome (GRCh38), version 35 (Ensembl 101) ##provider: GENCODE ##contact: [email protected] ##format: gtf ##date: 2020-06-03 chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2"; chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

quimollo avatar Nov 27 '20 17:11 quimollo

Hi, Don't use the -d option and run one TPMCalculator for BAM file using the -b option. The message: "Chromosome with name: GL000008.2 does not exist" is saying that you have reads aligned to the reference GL000008.2 and that reference is not in the GTF. Are you using the same GTF for the alignment and TMPCalculator?

r78v10a07 avatar Nov 30 '20 13:11 r78v10a07

Hi, yes, I'm using the same GTF file from GRCh38.p13. As I use STAR, I need to create indexes, which are recommended to be produced with the genome primary assembly, the GTF file, and the read length. So I create them with: STAR --runThreadN 10 --runMode genomeGenerate --genomeDir /path/to/GENOMES/GRCh38.p13_relase35/STAR_IDX_75/ --genomeFastaFiles /path/to/GENOMES/GRCh38.p13_relase35/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /path/to/GENOMES/GRCh38.p13_relase35/gencode.v35.annotation.gtf --sjdbOverhang 75

After running single BAM files with the following command (-b flag instead of -d): /path/to/TPMCalculator -g /path/to/GENOMES/GRCh38.p13_relase35/gencode.v35.annotation.gtf -b 2730-006_Aligned.sortedByCoord.out.bam -p -k gene_id

I get the same results, with errors messages which report that the following 41 references do not exist as chromosomes (Chromosome with name: REFERENCE does not exist): GL000008.2 GL000009.2 GL000194.1 GL000205.2 GL000214.1 GL000216.2 GL000218.1 GL000219.1 GL000221.1 GL000224.1 GL000225.1 KI270422.1 KI270438.1 KI270442.1 KI270466.1 KI270467.1 KI270706.1 KI270708.1 KI270710.1 KI270711.1 KI270712.1 KI270719.1 KI270722.1 KI270725.1 KI270727.1 KI270728.1 KI270729.1 KI270730.1 KI270731.1 KI270732.1 KI270733.1 KI270736.1 KI270741.1 KI270742.1 KI270743.1 KI270744.1 KI270745.1 KI270746.1 KI270748.1 KI270749.1 KI270751.1

I've found a reply in Biostars where they say that GL000008.2 is a contig in some part of chr4 and that KI270330.1 is an unplaced scaffold that presumably belongs to the human genome reference. https://www.biostars.org/p/215787/ And the unplaced scaffolds also appear in the genome assembly annotation: http://rest.ensembl.org/info/assembly/human

When running the following command: cat 2730-006_Aligned.sortedByCoord.out_genes.out | grep -o "ENSG[0-9]*.[0-9]*" | uniq | wc -l I get the number 25397, which I presume is the number of genes for which the TPMCalculator has quantified the TPMs, when running it with the -b flag.

But when running different samples using the -b flag I get 24631: cat 2730-009_Aligned.sortedByCoord.out_genes.out | grep -o "ENSG[0-9]*.[0-9]*" | uniq | wc -l

On the other hand when running the -d flag I get the same amount in all the samples: 44985.

The number of quantified genes in the GTF file is 60656: cat /path/to/GENOMES/GRCh38.p13_relase35/gencode.v35.annotation.gtf | grep -o "ENSG[0-9]*.[0-9]*" | uniq | wc -l

Is that the output you would expect on a normal run of TPMCalculator when running -b and -d flags? Does it output only the genes from which it has expression?

Thanks!

quimollo avatar Dec 02 '20 13:12 quimollo

I would recommend removing all contigs from the genome and using only the chromosomes. TPMCalcualtor gene model doesn't load the contigs. Therefore, it is reporting that reads on those contigs don't have reference.

--

Roberto Vera Alvarez, PhD

Profiles:

LinkedIn http://www.linkedin.com/in/robertoveraalvarez, ResearchGate http://www.researchgate.net/profile/Roberto_Vera_Alvarez, Google Citations http://scholar.google.it/citations?user=X9OKjuEAAAAJ, Twitter http://twitter.com/r78v10a07

Skype user: r78v10a07

On Wed, Dec 2, 2020 at 8:49 AM quimollo [email protected] wrote:

The commands above are wrong as I didn't put "" before the "*".

When running the following command: cat 2730-006_Aligned.sortedByCoord.out_genes.out | grep -o "ENSG[0-9].[0-9]" | uniq | wc -l I get the number 25397, which I presume is the number of genes for which the TPMCalculator has quantified the TPMs.

But the number of quantified genes I expect to have is 60656, as is the result of the unique ENSG gene references I get with the following command: cat /path/to/GENOMES/GRCh38.p13_relase35/gencode.v35.annotation.gtf | grep -o "ENSG[0-9].[0-9]" | uniq | wc -l

I also found the references for the "KI" unplaced scaffolds in the genome assembly annotation: http://rest.ensembl.org/info/assembly/human

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ncbi/TPMCalculator/issues/62#issuecomment-737241409, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB6FOHPCQWQDO4WWT7JIBDSSZAXPANCNFSM4UEDNJYQ .

r78v10a07 avatar Dec 03 '20 14:12 r78v10a07