GADMA icon indicating copy to clipboard operation
GADMA copied to clipboard

GADMA slow for 5 population model

Open dmacguigan opened this issue 7 months ago • 42 comments

Hello,

I'm attempting to run a 5 population model using the Moments engine with SMAC_BO_combination Bayesian optimization. GADMA has been running for 50 hours but still reports No models yet. I'm running this on a cluster compute node with 8 cores and 640 GB of RAM.

Is this long runtime normal for a larger model? Or is there something I can do to help speed things up?

Custom model file:

import moments
import numpy as np


def model_func(params, ns):
    """
    Five population demographic history for the species complex
    populations:
    A - Acin Elk 
    B - Acin upper Tennessee 
    C - Aana
    D - Amay Cumberland basin
    E - Amay Big South Fork

    All pops have constant population size.
    Order of splits in this model are based on prior phylogenetic results
    Ancestral population was split (t1 + t2) time ago into two new
    populations: AB and CDE. Next, CDE population was split into C and DE.
    Then AB then split into two populations, A and B. Finally, DE splits
    into two populations, D and E. There is symmetrical migration between pops A, B,
    C, D, and E after the final split. But there is no migration between
      ancestral populations.

    :param nAB: Size of ancestral population AB after earliest split
    :param nDCE: Size of ancestral population DCE after earliest split
    :param nA: Size of extant population A
    :param nB: Size of extant population B
    :param nC: Size of extant population C
    :param nDE: Size of ancestral population DE
    :param nD: Size of extant population D
    :param nE: Size of extant population E
    :params mAB, mAC, mAD, mAE, mBC, mBD, mBE, mCD, mCE, mDE: migration rates between each pair of extant populations
    :param t1: Time between ancestral population split to C - DE split 
    :param t2: Time between C - DE split and A - B split
    :param t3: Time between A - B split and D - E split
    :param t4: Time between D - E split to the present
    
    """

    nAB, nDCE, nA, nB, nC, nDE, nD, nE, mAB, mAC, mAD, mAE, mBC, mBD, mBE, mCD, mCE, mDE, t1, t2, t3, t4 = params
    sts = moments.LinearSystem_1D.steady_state_1D(sum(ns))
    fs = moments.Spectrum(sts)

    # first split: AB - CDE
    fs = fs.split(idx = 0, n0 = ns[0] + ns[3], n1 = ns[1] + ns[2] + ns[4])
    fs.integrate(Npop=[nAB, nDCE], tf=t1)                                 

    # second split: C - DE
    fs = fs.split(idx = 1, n0 = ns[1], n1 = ns[2] + ns[4])
    fs.integrate(Npop=[nAB, nC, nDE], tf=t2)

    # third split: A - B
    fs = fs.split(idx = 0, n0 = ns[0], n1 = ns[3])
    fs.integrate(Npop=[nA, nC, nDE, nB], tf=t3)

    # fourth split: D - E
    fs = fs.split(idx = 2, n0 = ns[2], n1 = ns[4])

    # Full 5x5 migration matrix
    mig_mat = [
        [0,   mAC, mAD, mAB, mAE],
        [mAC, 0,   mCD, mBC, mCE],
        [mAD, mCD, 0,   mBD, mDE],
        [mAB, mBC, mBD, 0,   mBE],
        [mAE, mCE, mDE, mBE, 0  ]
    ]

    fs.integrate(Npop=[nA, nC, nD, nB, nE], tf=t4, m=mig_mat)

    return fs(base)

Parameters file:

Input file: easySFS/output/dadi/A-C-D-B-E.sfs
Output directory: results/model_5pops_01_quick
Custom filename: models/model_5pops_01.py

Num init const: 2
Global optimizer: SMAC_BO_combination
Global maxeval: 200

Number of repeats: 8
Number of processes: 8

Linked SNP's: False

Engine: moments

# NEED TO REFINE THIS
Mutation rate: 2.35e-08

# NEED TO FIND REFERENCE FOR THIS
Time for generation: 2.0

#   Effective length of the sequence used to build the SFS data.
#   This should be used together with the Mutation rate and can be replaced
#   by the Theta0 setting.
#   Default: None
# https://gadma.readthedocs.io/en/latest/faq.html#how-can-i-get-effective-length-of-sequence-l
# Assume total length of sequence that was used for SFS building is equal to Nseq.
# From this data total number of X SNP’s were received.
# But not all of them were used for SFS: some of them (e.g. total number of Y SNP’s) were filtered out.
# Then we should count filtered SNP’s and take L value the following way:
# L = (X - Y) / X * Nseq
# in our case:
# X = 171832 [total number of variable sites in m80p _stats.txt file]
# 16823 SNPs in the m80p unlinked VCF, so 
# Y = 155009
# so L = ((171832 - 155009) / 171832) * 3995963) = 391220
Sequence length: 391220

#    Engine that will draw demographic model plots.
#    Can be moments or demes.
#    Default: moments
Model plot engine: demes

Units of time in drawing: thousand years

Thanks for your help, Dan

dmacguigan avatar May 21 '25 16:05 dmacguigan