make_lastz_chains
make_lastz_chains copied to clipboard
Clarity on pipeline CLI parameters
Hello, Is there more detailed documentation somewhere on each of the pipeline parameters? I'm trying to figure out how these parameters, specifically the specific memory requests, are interacting with Nextflow. E.g., does --chaining_memory request the amount of memory for each Nextflow job or everything in total? Also, should I assume all Nextflow jobs run on only 1 core? If I remember correctly, the previous version of make_chains had a parameter for requesting multiple CPUs. If each job only runs on 1 core then it will change the partition I run them on. I am trying to align some amphibian genomes that are quite large so I need to get creative with how I am breaking up the reference and query genomes to maximize efficiency.
This is my current script:
`#!/bin/bash #SBATCH --job-name=makeChains #SBATCH --array=4 #SBATCH --cpus-per-task=2 #SBATCH --time=4-00:00:00 #SBATCH --mail-type=all #SBATCH [email protected] #SBATCH --output=/scratch/sgable3/Myobatrichid_Chr_Analysis/LASTZ_chains/logs/frogs_makeChains_%A.%a.out #SBATCH --error=/scratch/sgable3/Myobatrichid_Chr_Analysis/LASTZ_chains/frogs_makeChains_%A.%a.err #SBATCH --mem=60GB #SBATCH --export=NONE #SBATCH -p general # Partition to submit to (adjust as needed) #SBATCH -q public
/home/sgable3/make_lastz_chains/make_chains.py --project_dir ${ref_species}_${query_species}_chains --cfs lastz
--cluster_executor slurm --cluster_queue htc --seq1_chunk 175000000 ---seq2_chunk 50000000 --chaining_memory 30
$ref_species $query_species $genome_dir/${ref_species}.allScaffs.genome.fasta $genome_dir/${query_species}.allScaffs.genome.fasta --job_time_req 03:00:00`
Hi Simone,
pretty much everything in this pipeline is single threaded, as far as I know. We parallelize by splitting the data (and not allocating more cores for a chunk of data).
Mem and other parameters are all adjusted to work well for single threaded single jobs. Not total.
Having said that, these parameters often have a big buffer. E.g. 95 chaining jobs may be fine with a less mem, but 5 need more.
For large amphibians, lastz wouldn't be a problem as you can break both genomes into more pieces. But chaining needs to run for entire reference genomes, i.e. if you have a large GB sized chromosome, then this may require more memory. I can estimate how much, though I hope say 200 GB are enough. Pls give it a try, but you may need a compute node with 500 GB or 1 TB of RAM (which is kind of standard these days). Let me know how this works.
- Michael
Thank you! The HPC staff at my institution have actually been having a lot of trouble getting the pipeline to run properly so I'm going to close this for now and open a new issue.
Hello, The pipeline is properly running in the custom module and now I'm back to memory issues :) A few more questions-
- Is it worthwhile to filter out smaller contigs prior to aligning? Because frog chromosomes are so large (in addition to the genome sizes), I know a lot of memory will be required, but one of our reference assemblies has 1,436 total sequences, assembly size 3.3 GB, and the other reference assembly we're using has 3,127 total sequences and assembly size of 8.6 GB. All assemblies in our dataset are long-read, chromosome-level, so I'm hoping that filtering out smaller contigs would reduce the amount of pairwise alignments without losing anything meaningful.
- How much memory does the lastz step typically require? I can confirm that all assemblies are thoroughly soft-masked using a combined de novo and reference-based library, so repeats shouldn't be the issue.
Our institution only has 2 high memory nodes and the queue time to use them is quite long, so I'm trying to make sure I'm running everything in the most efficient way possible. :)
Hi Simone, no need to filter out short contigs. They will just create short chains that don't need much memory. E.g. we have been aligning human as reference to the Steller Sea Cow with >1 Million scaffolds with a N50 of 1.4 kb. No problem. (also, you likely meant scaffolds, not contigs I assume; chains can never span different scaffolds, but can span different contigs)
- Lastz does not require much memory. I need to look up how much mem we allocate, but we never had an issue there. You can also always reduce the reference / query chunk sizes, which will reduce mem and give more but shorter jobs to run.
In case you see that some of your lastz jobs take forever, I would suggest to add WindowMasker to the softmask. Let me know how it goes.
Closing due to inactivity. Please feel free to re-open if problem persists!