kallisto icon indicating copy to clipboard operation
kallisto copied to clipboard

Are some cDNAs omitted from the kallisto index?

Open jamesboot opened this issue 3 years ago • 12 comments

Hi,

During some comparisons of STAR and kallisto we saw that the number of genes/transcripts that STAR aligns to and outputs is much larger than kallisto. This obviously makes sense and is expected, as STAR is aligning to the whole genome (human) whereas kallisto is only pseudo-aligning to the transcriptome/cDNA reference. However, on further investigation we found that there were approximately 20,000 human genes that STAR aligns to that are in the human cDNA reference - yet kallisto does not pseudo-align to these genes.

We were just wondering why this might be the case - are certain cDNA species omitted when the kallisto index is built? And if so why are some cDNA species omitted?

Thanks for your time in advance!

jamesboot avatar Sep 24 '21 09:09 jamesboot

No cDNA species are omitted and you should not be observing such results (as they contradict the results of numerous benchmarks performed in the literature). Further, it is not true that it would make sense: an RNA-seq library consists of cDNA fragments (which is what kallisto maps against) and does not contain the whole genome. Can you show how you're building the kallisto index and which files you're using? And can you also show the kallisto pseudoalignment command(s) you're running and how you're summarizing transcript results to the gene-level?

Yenaled avatar Sep 24 '21 10:09 Yenaled

For building kallisto index I'm using the Ensembl Homo_sapiens.GRCh38.cdna.all.fa , below is the command: kallisto index -i /path/to/index /path/to/Homo_sapiens.GRCh38.cdna.all.fa.gz

Running psuedo-alignment: kallisto quant -i /path/to/index/
-o $SUBDIR
-b 100
-t 4
$READS1 $READS2

Using tximport to import transcript abundances and summarise to gene level.

jamesboot avatar Sep 24 '21 11:09 jamesboot

Thanks the for the additional information. That Ensembl FASTA file does not contain any non-coding RNAs. Those will be missing (there are a little over 20,000 of those genes).

Yenaled avatar Sep 24 '21 12:09 Yenaled

That makes sense, thanks for your help. Is there are cDNA FASTA file you would recommend using?

jamesboot avatar Sep 27 '21 09:09 jamesboot

I recommend using our kb tool to create indices:

https://www.kallistobus.tools/kb_usage/kb_usage.html

Install kb-python and then download the genome primary assembly and GTF file to create a transcriptome index.

Alternately, you can simply append the non-coding RNA FASTA file to your cDNA FASTA file.

Yenaled avatar Sep 27 '21 11:09 Yenaled