minimap2
minimap2 copied to clipboard
minimap2 large memory consumption inside souporcell pipeline
Hello,
following this issue: https://github.com/wheaton5/souporcell/issues/179
minimap is used in souporcell pipeline like: https://github.com/wheaton5/souporcell/blob/master/souporcell_pipeline.py#L255#L257
I am out of ideas and thus reach out in case you have any suggestions. I could provide the problematic input file in case you could have a look. Thanks a lot in advance!
What are the reference and the query sequences?
The reference is danio_rerio from ensembl primary assembly and for the query sequences In this google drive you have the reference genome, the input fastq file and the minimap error:
I use version 2.17-r941
running command:
minimap2 -ax splice -t 20 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no genome.fa tmp_13_0.fq
and the minimap error is:
[M::mm_idx_gen::24.271*1.62] collected minimizers
[M::mm_idx_gen::26.108*2.47] sorted minimizers
[M::main::26.109*2.47] loaded/built the index for 993 target sequence(s)
[M::mm_mapopt_update::26.109*2.47] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 993
[M::mm_idx_stat::27.537*2.39] distinct minimizers: 155448698 (92.45% are singletons); average occurrences: 1.508; average spacing: 5.859
[M::worker_pipeline::31.378*3.90] mapped 555556 sequences
Killed
Could you try the latest version?
Your fastq is coordinate sorted. This may cause various problems with both bwa and minimap2. Please randomize the input bam with samtools collate like this example.
hello @lh3 , thanks for the suggestions! Looks like indeed I was using an old version of minimap2. I just tried with 2.26 and it worked just fine, the memory consumption was reasonable and stable!
Thanks for confirmation. Minimap2 is not developed for short-read RNA-seq, mostly because it can't handle paired-end reads. How does souporcell use minimap2? Does it work?
hello @lh3, I can not answer in detail your question, thus let me tag the main developer @wheaton5
I have used minimap2 for short read scRNAseq with custom parameters. It does not deal with paired end reads as scRNAseq cDNA reads only come from one read. The other read has the UMI and cell barcode etc. I dont think this should be of concern to you. I should deal with this within my own project. I have an option for no remapping or for hisat2. Still, thanks for responding. Fyi I was advised by durbin for my phd and I very much look up to you. I dont want to take up your valuable time on this issue that I should resolve myself.
@clementhelsens lets work together to either use the skip remapping option or use hisat2 which is optimized for short read rnaseq data
Thanks for your response about randomization. This seems like an easy fix on my end. The only reason it is sorted is that the input is a sorted bam but the mapping for rnaseq is very poorly optimized for variant calling. Initially i used minimap to fix this. Later i found hisat2 is better for short read rnaseq (as you said, this is not the intention for minimap2).
Sure! thanks @wheaton5, but as I mentioned in the thread, following @lh3 suggestion, using a more recent version of minimap2 solved the very large memory consumption issue, but I don't know about the reproducibility. Let's follow up on souporcell issue for alternative options. I'm happy to test on my end with my use case which is similar to Fig 1-A here