kallistobustools icon indicating copy to clipboard operation
kallistobustools copied to clipboard

Improving speed when running `kb count`

Open reetm09 opened this issue 1 year ago • 3 comments

When you input multiple FASTQ files into the kb count function, does it process them sequentially or is there a way to parallelize it? Especially because for me, the first step "kallisto bus" takes the longest (when loading the index and mapping). Is there a way to parallelize this process or any other tips to improve speed?

Thank you!

reetm09 avatar Nov 06 '23 23:11 reetm09

It should automatically parallelize (rather than sequential reading) if you enable many threads -- that's one reason that splitting FASTQ files into multiple chunks enables faster processing.

kallisto should be pretty fast unless you're doing single nucleus rnaseq or rna velocity -- with enough threads, it will only take 1-3 seconds to process a million reads.

Also, make sure you're using the current version of kb-python (version 0.27.3) since speed improvements have been made.

Finally, post issues on the kallisto or the kb-python github page -- I'm usually more responsive on those pages.

Yenaled avatar Nov 07 '23 00:11 Yenaled

Hi,

Thank you so much for your quick response! This is the command I'm running for RNA Velocity analysis. Currently it's taking 30-40 mins and each of the fastq's are 1000 reads, with the index file being ~40GB. Additionally, each of the files here are 119MB. Is this expected?

kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample1 --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz

Additionally, just to clarify once again, if I specify the following command, it should already be parallelizing? kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz subSample2_R1.fastq.gz subSample2_R2.fastq.gz subSample3_R1.fastq.gz subSample3_R2.fastq.gz

Or do I need to do anything additional to split the FASTQ files into multiple chunks? And would the output folder (subSample) here contain the combined .h5ad file?

Thanks so much for your help!

reetm09 avatar Nov 07 '23 00:11 reetm09

OK, yes, rna velocity is just slow with kallisto. This will change in our forthcoming release of kb-python (version 0.28; currently on devel branch), which will be released in the next week or so.

I don't think there's much you can do in terms of speed with the current version of kb-python.

And yes, it will be parallelizing automatically with the command you supplied (and the output will be no different than combining the subsamples into a single fastq file).

Yenaled avatar Nov 07 '23 02:11 Yenaled