pangenome icon indicating copy to clipboard operation
pangenome copied to clipboard

wfmash step to speed up

Open OZTaekOppa opened this issue 1 year ago • 6 comments

Description of feature

Dear nf-core & pangenome team,

I have a few questions about your great program.

Based on the link (https://github.com/nf-core/pangenome/blob/1.0.0/modules/nf-core/wfmash/main.nf), it appears that wfmash performs all-vs-all alignment on a single node.

wfmash \\
    ${fasta_gz} \\
    $query \\
    $query_list \\
    --threads $task.cpus \\
    $paf_mappings \\
    $args > ${prefix}.paf

From my trials, this is indeed the case.

I am trying to speed up the wfmash process on multiple nodes (PBSpro) by running parallel jobs. My idea is to perform one-vs-all alignments for each node from an input full genome dataset (120 human pangenomes), and then merge the results into a single paf file for further analysis.

  1. Do you have any recommendations for tweaking the wfmash code to achieve this?
  2. If I run one-vs-all alignments on each node, will the merged paf file be equivalent to an all-vs-all alignment? Theoretically, I assume the final outcome should be the same.

Looking forward to your insights.

Kind regards,

Taek

OZTaekOppa avatar Aug 01 '24 05:08 OZTaekOppa

Dear @OZTaekOppa,

Per default, wfmash indeed only makes use of one node. However, there is a parameter called --wfmash_chunks https://nf-co.re/pangenome/1.1.2/parameters/#wfmash_chunks which allows nf-core/pangenome to scale the all-vs-all base pair level alignments across nodes of a cluster. This was also extensively evaluated in https://www.biorxiv.org/content/10.1101/2024.05.13.593871v1.

Just to be clear about wfmash again, when wfmash_chunks > 1:

  1. wfmash is run in approximate mapping mode which finds sequence homologies determined by the given wfmash parameters https://github.com/nf-core/pangenome/blob/af6d1ddca7db6714728c1c14dea4d3eca065e52c/subworkflows/local/pggb.nf#L47
  2. The resulting PAF is split into chunks of equal alignment problem size, the number of chunks is given by --wfmash_chunks https://github.com/nf-core/pangenome/blob/af6d1ddca7db6714728c1c14dea4d3eca065e52c/subworkflows/local/pggb.nf#L51
  3. For each such chunked PAF we can run wfmash in base pair level alignment mode on nodes of a cluster in paralleld https://github.com/nf-core/pangenome/blob/af6d1ddca7db6714728c1c14dea4d3eca065e52c/subworkflows/local/pggb.nf#L54

I hope this answers your question!

subwaystation avatar Aug 01 '24 06:08 subwaystation

I didn't test it for one vs. all, but it should work out the same way.

subwaystation avatar Aug 01 '24 06:08 subwaystation

This question is also discussed at https://github.com/pangenome/pggb/issues/403.

subwaystation avatar Aug 01 '24 06:08 subwaystation

Hi @subwaystation,

Thank you for your prompt reply. I will get back to you after testing your suggestion.

Cheers,

Taek

OZTaekOppa avatar Aug 01 '24 06:08 OZTaekOppa

Hi @subwaystation,

The current single-node approach requires significant RAM, CPUs, and extended walltime. The HPC team is exploring alternative solutions to run parallel jobs across multiple nodes.

From testing a small dataset, both the all-vs-all and one-vs-all approaches produced the same outcome. Currently, I am working with the team to optimize the partition and PGGB steps for Nextflow.

Cheers,

Taek

OZTaekOppa avatar Aug 16 '24 06:08 OZTaekOppa

I am a little bit confused. There is an option to directly run wfmash across several nodes, as stated above. Did you try this one?

Else I am curious, how your plans will turn out :)

subwaystation avatar Aug 16 '24 07:08 subwaystation

Assuming this is solved. Else please re-open.

subwaystation avatar Jan 16 '25 08:01 subwaystation