vireo icon indicating copy to clipboard operation
vireo copied to clipboard

Vireo taking ages to run

Open lucygarner opened this issue 2 years ago • 4 comments

Hi,

I have some single-cell RNA-seq data for which I don't have genotype information.

I ran cellSNP-lite on a merged BAM file containing all of the donors to genotype the single cells as follows:

cellsnp-lite -s data.dir/merged.bam -b data.dir/barcodes.tsv -O results.dir/merged.dir -R vcf/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz --genotype --minCOUNT 10 --minMAF 0.1 -p 10

I am now running Vireo as follows:

vireo -c results.dir/merged.dir -N 4 -o results/merged.dir --randSeed=3245 -p 30

However, it has been running for three days and still hasn't finished. I have spoken to others who have used Vireo and they mentioned that it was fast, so I'm not sure if I'm doing something wrong?

This is the log message so far:

[vireo] Loading cell folder ...
[vireo] Demultiplex 41622 cells to 4 donors with 104779 variants.

Many thanks for the help.

Best wishes, Lucy

lucygarner avatar Mar 17 '22 10:03 lucygarner

This is the log for cellSNP-lite in case that helps.

[I::main] start time: 2022-03-07 10:57:55
[W::check_args] Max depth set to maximum value (2147483647)
[I::main] loading the VCF file for given SNPs ...
[I::main] fetching 7352497 candidate variants ...
[I::main] mode 1a: fetch given SNPs in 41622 single cells.
[I::csp_fetch_core][Thread-2] 2.00% SNPs processed.
[I::csp_fetch_core][Thread-3] 2.00% SNPs processed.
[I::csp_fetch_core][Thread-5] 2.00% SNPs processed.
...
[I::csp_fetch_core][Thread-9] 90.00% SNPs processed.
[I::csp_fetch_core][Thread-9] 92.00% SNPs processed.
[I::csp_fetch_core][Thread-9] 94.00% SNPs processed.
[I::csp_fetch_core][Thread-9] 96.00% SNPs processed.
[I::csp_fetch_core][Thread-9] 98.00% SNPs processed.
[I::main] All Done!
[I::main] end time: 2022-03-08 10:09:17
[I::main] time spent: 83482 seconds.

lucygarner avatar Mar 17 '22 10:03 lucygarner

Hi Lucy,

Thanks for the issue. Your dataset indeed looks relatively large. I wonder if the memory is a bottleneck. You check the memory usage by free -h.

If it is the case, you can change your command line to -p 1 by only using one CPU.

Another is that you may set a more stringent cutoff on --minCOUNT, e.g., with 30 or 100 in cellsnp. It looks you already have much more than enough variants. Probably, this is not the fastest strategy to sort it out, as you need to re-run cellsnp.

Yuanhua

huangyh09 avatar Mar 17 '22 10:03 huangyh09

Hi @huangyh09,

Thank you for the quick response. I am running the command on a large compute cluster but maybe I didn't specify enough memory. How much memory would you recommend specifying?

Why do you suggest to use only one CPU (-p 1)? Would using more CPUs not make it quicker?

If this does not work, I will try increasing the --minCOUNT threshold for cellSNP.

Best wishes, Lucy

lucygarner avatar Mar 17 '22 13:03 lucygarner

I see. Probably you could start with specifying 50GB memory. I guess it won't use more than 100GB. Another major factor for memory usage is the n_CPUs it uses, as n copies for data will be used, one for each sub-processor. So you may use -p 4 as a safer start instead of 30.

Yuanhua

huangyh09 avatar Mar 17 '22 14:03 huangyh09