souporcell
souporcell copied to clipboard
high storage use on HPC
Hi all,
I've been very happy using souporcell and this forum has been excellent to answer many of my questions. Thanks.
I am running souporcell on the same sample dataset, but with varying values of k on a high-performance computing (HPC) cluster. There are different ways to do this, but I have chosen to do it with individual jobs (.sh). Unfortunately, when I run 15-30 of these jobs over the course of a couple of days, my allocated storage of 20TB on the cluster starts to evaporate very quickly.
This problem is odd to me, because the inputs and outputs of souporcell are not that memory intensive to use several terabytes. However, my guess is that all of the intermediate steps in the souporcell pipeline are leading to the problem. Does anyone have suggestions to help? I include a template job script I use for nearly all of the jobs below:
#!/bin/bash #SBATCH --account=myusername #SBATCH --cpus-per-task=8 #SBATCH --mem=33G
####my HPC does not use singularity, so I need to use apptainer instead module load apptainer
###this is a sample script for 54 clusters, so k=54 here
apptainer exec -B /home/pathtodirectorywithfilesonHPC:/test souporcell_latest.sif souporcell_pipeline.py -i /test/possorted_genome_bam.bam -b /test/barcodes.tsv.gz -f /test/referencegenomeofmystudyorganism.fasta -t 8 -o /test/out_230119_33g_54k_newref -k 54
out of interest, is there a reason you don't use the --skip_remap option? That avoids wasting a great deal disk space and CPU cycles by not running the full minimap assembly for each k again.
If you dont have a common variants file, remapping is highly suggested. Maybe run them in series? And you could delete everything but the files you need?
The SAM files are not zipped (which is fair, they are deleted after all the processes finish), this can cause high storage use if you are running demultiplexing on multiple 10x lanes/experiments at a time. That can indeed be alleviated by either skipping the remapping entirely, or by only demultiplexing one 10x lane/experiment at a time. I tried to reduce the storage usage by zipping the SAM files and reading the zipped SAM files instead, by modifying some of the scripts.
Another alternative is using a folder that has already done everything up to vartrix and running different k against that incomplete run (you can delete .done files for various stages to have it rerun those stages). I know this is a bit cumbersome. I plan on implementing an automatic multi-k scenario at some point, but I don't have funding to work on this right now.