minimap2 minimap2 large memory consumption inside souporcell pipeline

Hello,

following this issue: https://github.com/wheaton5/souporcell/issues/179

minimap is used in souporcell pipeline like: https://github.com/wheaton5/souporcell/blob/master/souporcell_pipeline.py#L255#L257

I am out of ideas and thus reach out in case you have any suggestions. I could provide the problematic input file in case you could have a look. Thanks a lot in advance!

Apr 26 '23 14:04 clementhelsens

What are the reference and the query sequences?

Apr 26 '23 14:04 lh3

The reference is danio_rerio from ensembl primary assembly and for the query sequences In this google drive you have the reference genome, the input fastq file and the minimap error:

I use version 2.17-r941 running command:

minimap2 -ax splice -t 20 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no genome.fa tmp_13_0.fq

and the minimap error is:

[M::mm_idx_gen::24.271*1.62] collected minimizers
[M::mm_idx_gen::26.108*2.47] sorted minimizers
[M::main::26.109*2.47] loaded/built the index for 993 target sequence(s)
[M::mm_mapopt_update::26.109*2.47] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 993
[M::mm_idx_stat::27.537*2.39] distinct minimizers: 155448698 (92.45% are singletons); average occurrences: 1.508; average spacing: 5.859
[M::worker_pipeline::31.378*3.90] mapped 555556 sequences
Killed

Apr 27 '23 07:04 clementhelsens

Could you try the latest version?

Apr 30 '23 01:04 lh3

Your fastq is coordinate sorted. This may cause various problems with both bwa and minimap2. Please randomize the input bam with samtools collate like this example.

Apr 30 '23 02:04 lh3

hello @lh3 , thanks for the suggestions! Looks like indeed I was using an old version of minimap2. I just tried with 2.26 and it worked just fine, the memory consumption was reasonable and stable!

May 03 '23 06:05 clementhelsens

Thanks for confirmation. Minimap2 is not developed for short-read RNA-seq, mostly because it can't handle paired-end reads. How does souporcell use minimap2? Does it work?

May 04 '23 16:05 lh3

hello @lh3, I can not answer in detail your question, thus let me tag the main developer @wheaton5

May 05 '23 05:05 clementhelsens

I have used minimap2 for short read scRNAseq with custom parameters. It does not deal with paired end reads as scRNAseq cDNA reads only come from one read. The other read has the UMI and cell barcode etc. I dont think this should be of concern to you. I should deal with this within my own project. I have an option for no remapping or for hisat2. Still, thanks for responding. Fyi I was advised by durbin for my phd and I very much look up to you. I dont want to take up your valuable time on this issue that I should resolve myself.

May 05 '23 05:05 wheaton5

@clementhelsens lets work together to either use the skip remapping option or use hisat2 which is optimized for short read rnaseq data

May 05 '23 05:05 wheaton5

Thanks for your response about randomization. This seems like an easy fix on my end. The only reason it is sorted is that the input is a sorted bam but the mapping for rnaseq is very poorly optimized for variant calling. Initially i used minimap to fix this. Later i found hisat2 is better for short read rnaseq (as you said, this is not the intention for minimap2).

May 05 '23 05:05 wheaton5

Sure! thanks @wheaton5, but as I mentioned in the thread, following @lh3 suggestion, using a more recent version of minimap2 solved the very large memory consumption issue, but I don't know about the reproducibility. Let's follow up on souporcell issue for alternative options. I'm happy to test on my end with my use case which is similar to Fig 1-A here

May 05 '23 05:05 clementhelsens

minimap2 minimap2 copied to clipboard

minimap2 large memory consumption inside souporcell pipeline

minimap2
minimap2 copied to clipboard