wfmash step to speed up
Description of feature
Dear nf-core & pangenome team,
I have a few questions about your great program.
Based on the link (https://github.com/nf-core/pangenome/blob/1.0.0/modules/nf-core/wfmash/main.nf), it appears that wfmash performs all-vs-all alignment on a single node.
wfmash \\
${fasta_gz} \\
$query \\
$query_list \\
--threads $task.cpus \\
$paf_mappings \\
$args > ${prefix}.paf
From my trials, this is indeed the case.
I am trying to speed up the wfmash process on multiple nodes (PBSpro) by running parallel jobs. My idea is to perform one-vs-all alignments for each node from an input full genome dataset (120 human pangenomes), and then merge the results into a single paf file for further analysis.
- Do you have any recommendations for tweaking the wfmash code to achieve this?
- If I run one-vs-all alignments on each node, will the merged paf file be equivalent to an all-vs-all alignment? Theoretically, I assume the final outcome should be the same.
Looking forward to your insights.
Kind regards,
Taek
Dear @OZTaekOppa,
Per default, wfmash indeed only makes use of one node. However, there is a parameter called --wfmash_chunks https://nf-co.re/pangenome/1.1.2/parameters/#wfmash_chunks which allows nf-core/pangenome to scale the all-vs-all base pair level alignments across nodes of a cluster. This was also extensively evaluated in https://www.biorxiv.org/content/10.1101/2024.05.13.593871v1.
Just to be clear about wfmash again, when wfmash_chunks > 1:
- wfmash is run in approximate mapping mode which finds sequence
homologiesdetermined by the given wfmash parameters https://github.com/nf-core/pangenome/blob/af6d1ddca7db6714728c1c14dea4d3eca065e52c/subworkflows/local/pggb.nf#L47 - The resulting PAF is split into chunks of equal alignment problem size, the number of chunks is given by
--wfmash_chunkshttps://github.com/nf-core/pangenome/blob/af6d1ddca7db6714728c1c14dea4d3eca065e52c/subworkflows/local/pggb.nf#L51 - For each such chunked PAF we can run wfmash in base pair level alignment mode on nodes of a cluster in paralleld https://github.com/nf-core/pangenome/blob/af6d1ddca7db6714728c1c14dea4d3eca065e52c/subworkflows/local/pggb.nf#L54
I hope this answers your question!
I didn't test it for one vs. all, but it should work out the same way.
This question is also discussed at https://github.com/pangenome/pggb/issues/403.
Hi @subwaystation,
Thank you for your prompt reply. I will get back to you after testing your suggestion.
Cheers,
Taek
Hi @subwaystation,
The current single-node approach requires significant RAM, CPUs, and extended walltime. The HPC team is exploring alternative solutions to run parallel jobs across multiple nodes.
From testing a small dataset, both the all-vs-all and one-vs-all approaches produced the same outcome. Currently, I am working with the team to optimize the partition and PGGB steps for Nextflow.
Cheers,
Taek
I am a little bit confused. There is an option to directly run wfmash across several nodes, as stated above. Did you try this one?
Else I am curious, how your plans will turn out :)
Assuming this is solved. Else please re-open.